Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks

@article{Oymak2020TowardMO,
  title={Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks},
  author={Samet Oymak and Mahdi Soltanolkotabi},
  journal={IEEE Journal on Selected Areas in Information Theory},
  year={2020},
  volume={1},
  pages={84-105}
}
Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that… Expand
On the Convergence of Deep Networks with Sample Quadratic Overparameterization
TLDR
A tight finite-width Neural Tangent Kernel (NTK) equivalence is derived, suggesting that neural networks trained with this technique generalize well at least as good as its NTK, and it can be used to study generalization as well. Expand
Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks
TLDR
Under a rich dataset model, it is shown that gradient descent is provably robust to noise/corruption on a constant fraction of the labels despite overparameterization and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting. Expand
Nearly Minimal Over-Parametrization of Shallow Neural Networks
TLDR
It is established that linear overparametrization is sufficient to fit the training data, using a simple variant of the (stochastic) gradient descent. Expand
Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis
TLDR
This paper rigorously proves the linear convergence of gradient descent in two weakly-trained and jointly-trained regimes and indicates the considerable benefits of joint training over weak training in finding global optima, achieving a dramatic decrease in the required level of over-parameterization. Expand
An Improved Analysis of Training Over-parameterized Deep Neural Networks
TLDR
An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided. Expand
Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks
TLDR
The theory presented addresses the following core question: "should one train a small model from the beginning, or first train a large model and then prune?", and analytically identifies regimes in which, even if the location of the most informative features is known, the authors are better off fitting a large models and thenPruning rather than simply training with the known informative features. Expand
The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve
Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that theyExpand
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
TLDR
The expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent and random initialization can be bounded by the training Loss of a random feature model induced by the network gradient at initialization, which is called a neural tangent random feature (NTRF) model. Expand
Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping
TLDR
This work explores the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent and proposes an early stopping rule that allows them to show optimal rates. Expand
Overparameterized Nonlinear Optimization with Applications to Neural Nets
  • Samet Oymak
  • Computer Science
  • 2019 13th International conference on Sampling Theory and Applications (SampTA)
  • 2019
TLDR
This talk shows that solution found by first order methods, such as gradient descent, has the property that it has near shortest distance to the initialization of the algorithm among all other solutions, and advocates that shortest distance property can be a good proxy for the simplest explanation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 50 REFERENCES
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
TLDR
This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model. Expand
An Improved Analysis of Training Over-parameterized Deep Neural Networks
TLDR
An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided. Expand
Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks
TLDR
It is shown that with the quadratic activations, the optimization landscape of training, such shallow neural networks, has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. Expand
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand
Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?
TLDR
This paper demonstrates the utility of the general theory of (stochastic) gradient descent for a variety of problem domains spanning low-rank matrix recovery to neural network training and develops novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates. Expand
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
TLDR
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions. Expand
Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
TLDR
A data-dependent optimization and generalization theory which leverages the low-rank structure of the Jacobian matrix associated with the network and shows that even constant width neural nets can provably generalize for sufficiently nice datasets. Expand
Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression
TLDR
This work proves that under Gaussian input, the empirical risk function employing quadratic loss exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, for a class of smooth activation functions satisfying certain properties, including sigmoid and tanh, as soon as the sample complexity is sufficiently large. Expand
...
1
2
3
4
5
...