• Corpus ID: 53250107

A Convergence Theory for Deep Learning via Over-Parameterization

@article{AllenZhu2019ACT,
  title={A Convergence Theory for Deep Learning via Over-Parameterization},
  author={Zeyuan Allen-Zhu and Yuanzhi Li and Zhao Song},
  journal={ArXiv},
  year={2019},
  volume={abs/1811.03962}
}
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. [] Key Result In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

Figures from this paper

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
TLDR
This work establishes sharp optimization and generalization guarantees for deep ReLU networks under various assumptions made in previous work, and shows that network width polylogarithmic in $n$ and $\epsilon^{-1}$.
A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
TLDR
It is proved that under certain assumption on the data distribution that is milder than linear separability, gradient descent with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error, leading to an algorithmic-dependent generalization error bound for deep learning.
On the Convergence of Deep Networks with Sample Quadratic Overparameterization
TLDR
A tight finite-width Neural Tangent Kernel (NTK) equivalence is derived, suggesting that neural networks trained with this technique generalize well at least as good as its NTK, and it can be used to study generalization as well.
An Improved Analysis of Training Over-parameterized Deep Neural Networks
TLDR
An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided.
HOW MUCH OVER-PARAMETERIZATION IS SUFFI-
TLDR
The question whether deep neural networks can be learned with such a mild over-parameterization affirmatively is answered and sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent are established.
Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks
TLDR
An algorithm-dependent generalization error bound for deep ReLU networks is derived, and it is shown that under certain assumptions on the data distribution, gradient descent with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily smallgeneralization error.
Convergence Analysis of Training Two-Hidden-Layer Partially Over-Parameterized ReLU Networks via Gradient Descent
TLDR
This work provides a probabilistic lower bound of the widths of hidden layers and proved linear convergence rate of gradient descent, the first theoretical work to understand convergence properties of deep over-parameterized networks without the equally-wide-hidden-layer assumption and other unrealistic assumptions.
Training Over-parameterized Deep ResNet Is almost as Easy as Training a Two-layer Network
TLDR
This paper successfully removes the dependence of the width on the depth of the network for ResNet and reaches a conclusion that training deep residual network can be as easy as training a two-layer network and theoretically justifies the benefit of skip connection in terms of facilitating the convergence of gradient descent.
A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth
TLDR
A mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to infinity, and proposes a new continuum limit, which enjoys a good landscape in the sense that every local minimizer is global.
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
TLDR
The expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent and random initialization can be bounded by the training Loss of a random feature model induced by the network gradient at initialization, which is called a neural tangent random feature (NTRF) model.
...
...

References

SHOWING 1-10 OF 72 REFERENCES
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
TLDR
A convergence analysis for SGD is provided on a rich subset of two-layer feedforward networks with ReLU activations characterized by a special structure called "identity mapping" that proves that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps.
On the Convergence Rate of Training Recurrent Neural Networks
TLDR
It is shown when the number of neurons is sufficiently large, meaning polynomial in the training data size and in thelinear convergence rate, then SGD is capable of minimizing the regression loss in the linear convergence rate and gives theoretical evidence of how RNNs can memorize data.
A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks
TLDR
The speed of convergence to global optimum for gradient descent training a deep linear neural network is analyzed by minimizing the $\ell_2$ loss over whitened data by maximizing the initial loss of any rank-deficient solution.
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
Recovery Guarantees for One-hidden-layer Neural Networks
TLDR
This work distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective, and provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\math it{logarithmic}$in the precision.
Identity Matters in Deep Learning
TLDR
This work gives a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima and shows that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size.
On the Complexity of Learning Neural Networks
TLDR
A comprehensive lower bound is demonstrated ruling out the possibility that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently, and is robust to small perturbations of the true weights.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
TLDR
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.
Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels
TLDR
This is the first work that provides recovery guarantees for CNNs with multiple kernels under polynomial sample and computational complexities and shows that tensor methods are able to initialize the parameters to the local strong convex region.
...
...