• Corpus ID: 22226660

Weight Sharing is Crucial to Succesful Optimization

  title={Weight Sharing is Crucial to Succesful Optimization},
  author={Shai Shalev-Shwartz and Ohad Shamir and Shaked Shammah},
Exploiting the great expressive power of Deep Neural Network architectures, relies on the ability to train them. While current theoretical work provides, mostly, results showing the hardness of this task, empirical evidence usually differs from this line, with success stories in abundance. A strong position among empirically successful architectures is captured by networks where extensive weight sharing is used, either by Convolutional or Recurrent layers. Additionally, characterizing specific… 
Learning Activation Functions: A new paradigm of understanding Neural Networks
It is shown that using SLAF along with standard activations can provide performance improvements with only a small increase in number of parameters, and it is proved that SLNNs can approximate any neural network with lipschitz continuous activations, to any arbitrary error highlighting their capacity and possible equivalence with standard NNs.
ResNet with one-neuron hidden layers is a Universal Approximator
We demonstrate that a very deep ResNet with stacked modules with one neuron per hidden layer and ReLU activation functions can uniformly approximate any Lebesgue integrable function in $d$
When is a Convolutional Filter Easy To Learn?
It is shown that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches.
Learning One-hidden-layer ReLU Networks via Gradient Descent
It is proved that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error.
Background: Statistical mechanics results (Dauphin et al. (2014); Choromanska et al. (2015)) suggest that local minima with high error are exponentially rare in high dimensions. However, to prove low
Computational Separation Between Convolutional and Fully-Connected Networks
A class of problems that can be efficiently solved using convolutional networks trained with gradient-descent, but at the same time is hard to learn using a polynomial-size fully-connected network is shown.
Exponentially vanishing sub-optimal local minima in multilayer neural networks
It is proved that, with high probability in the limit of $N\rightarrow\infty$ datapoints, the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima.
The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies
It is shown theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies and specific predictions of the time it will take a network to learn functions of varying frequency are led.
Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima
We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j
Gradient Descent for Non-convex Problems in Modern Machine Learning
  • S. Du
  • Computer Science
  • 2019
Geometric contributions to fill the gap between theory and practice on the gradient descent algorithm are presented and it is shown gradient descent can take exponential time to optimize a smooth function with the strict saddle point property for which the noise-injected gradient can optimize in polynomial time.


Identity Matters in Deep Learning
This work gives a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima and shows that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size.
On the Quality of the Initial Basin in Overspecified Neural Networks
This work studies thegeometric structure of the associated non-convex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters, and identifies some conditions under which it becomes more favorable to optimization.
Distribution-Specific Hardness of Learning Neural Networks
  • O. Shamir
  • Computer Science, Mathematics
    J. Mach. Learn. Res.
  • 2018
This paper identifies a family of simple target functions, which are difficult to learn even if the input distribution is "nice", and provides evidence that neither class of assumptions alone is sufficient.
Failures of Deep Learning
This work describes four families of problems for which some of the commonly used existing algorithms fail or suffer significant difficulty, and provides theoretical insights explaining their source, and how they might be remedied.
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
This work provides the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations, and shows that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time.
Failures of Gradient-Based Deep Learning
This work describes four types of simple problems, for which the gradient-based algorithms commonly used in deep learning either fail or suffer from significant difficulties.
Global Optimality in Tensor Factorization, Deep Learning, and Beyond
This framework derives sufficient conditions to guarantee that a local minimum of the non-convex optimization problem is a global minimum and shows that if the size of the factorized variables is large enough then from any initialization it is possible to find a global minimizer using a purely local descent algorithm.
Provable Bounds for Learning Some Deep Representations
This work gives algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others, based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure.
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions.
On the Computational Efficiency of Training Neural Networks
This paper revisits the computational complexity of training neural networks from a modern perspective and provides both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks.