• Corpus ID: 233237300

One-pass Stochastic Gradient Descent in overparametrized two-layer neural networks

  title={One-pass Stochastic Gradient Descent in overparametrized two-layer neural networks},
  author={Hanjing Zhu and Jiaming Xu},
There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous work assumes that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error… 

Figures from this paper



Mean-Field Analysis of Two-Layer Neural Networks: Non-Asymptotic Rates and Generalization Bounds

A mean-field analysis in a generalized neural tangent kernel regime is provided, and it is shown that noisy gradient descent with weight decay can still exhibit a "kernel-like" behavior, which implies that the training loss converges linearly up to a certain accuracy in such regime.

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory.

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.

An Improved Analysis of Training Over-parameterized Deep Neural Networks

An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided.

Gradient descent optimizes over-parameterized deep ReLU networks

The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent.

A Convergence Theory for Deep Learning via Over-Parameterization

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

On the Convergence Rate of Training Recurrent Neural Networks

It is shown when the number of neurons is sufficiently large, meaning polynomial in the training data size and in thelinear convergence rate, then SGD is capable of minimizing the regression loss in the linear convergence rate and gives theoretical evidence of how RNNs can memorize data.

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

The expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent and random initialization can be bounded by the training Loss of a random feature model induced by the network gradient at initialization, which is called a neural tangent random feature (NTRF) model.

On Lazy Training in Differentiable Programming

This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

This paper shows that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions, and generalizes this analysis to the case of unbounded activation functions.