Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup

@article{Goldt2019DynamicsOS,
  title={Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup},
  author={Sebastian Goldt and Madhu S. Advani and Andrew M. Saxe and F. Krzakala and L. Zdeborov{\'a}},
  journal={Journal of Statistical Mechanics (Online)},
  year={2019},
  volume={2020}
}
Abstract Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher–student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations… Expand
Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network
TLDR
It is proved that when the gradient is zero at every data point in training, there exists many-to-one alignment between student and teacher nodes in the lowest layer under mild conditions, which suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Expand
The Gaussian equivalence of generative models for learning with two-layer neural networks
TLDR
This work establishes rigorous conditions under which a class of generative models shares key statistical properties with an appropriately chosen Gaussian feature model, and uses this Gaussian equivalence theorem (GET) to derive a closed set of equations that describe the dynamics of two-layer neural networks trained using one-pass stochastic gradient descent on data drawn from a large class of generators. Expand
Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft Committee Machines
TLDR
For on-line learning of a two-layer soft committee machine in the over-realizable case, this work finds that the approach to perfect learning occurs in a power-law fashion rather than exponentially as in the realizable case. Expand
LEARNING ONE-HIDDEN-LAYER NEURAL NETWORKS ON GAUSSIAN MIXTURE MODELS WITH GUARAN-
We analyze the learning problem of fully connected neural networks with the sigmoid activation function for binary classification in the teacher-student setup, where the outputs are assumed to beExpand
Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting
TLDR
A teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks, and a sharp rate of convergence is obtained, i.e., Õd(n −2/3)∗, for classifiers trained using either 0-1 loss or hinge loss. Expand
SERVATION LAWS IN DEEP LEARNING DYNAMICS
Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of aExpand
Representation mitosis in wide neural networks
TLDR
It is shown that a key ingredient to activate mitosis is continuing the training process until the training error is zero, and that in one of the learning tasks, a wide model with several automatically developed clones performs significantly better than a deep ensemble based on architectures in which the last layer has the same size as the clones. Expand
Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural Networks
TLDR
This work studies the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, and develops a novel set of tools for studying families of spurious minima using powerful tools from equivariant bifurcation theory. Expand
Understanding Diversity based Pruning of Neural Networks - Statistical Mechanical Analysis
TLDR
This work sets up the problem in the statistical mechanics formulation of a teacher-student framework and deriving generalization error (GE) bounds of specific pruning methods to prove that baseline random edge pruning method performs better than DPP node pruned method and proposes a DPP edge pruned technique for neural networks which empirically outperforms other competing pruned methods on real datasets. Expand
If deep learning is the answer, what is the question?
TLDR
A road map of how neuroscientists can use deep networks to model and understand biological brains is offered to offer a road map for systems neuroscience research in the age of deep learning. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 70 REFERENCES
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand
High-dimensional dynamics of generalization error in neural networks
TLDR
It is found that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks, and standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks. Expand
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
TLDR
It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. Expand
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand
Gradient descent optimizes over-parameterized deep ReLU networks
TLDR
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. Expand
On Lazy Training in Differentiable Programming
TLDR
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Expand
Comparing Dynamics: Deep Neural Networks versus Glassy Systems
TLDR
Numerically the training dynamics of deep neural networks (DNN) are analyzed by using methods developed in statistical physics of glassy systems to suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. Expand
Learning by on-line gradient descent
We study on-line gradient-descent learning in multilayer networks analytically and numerically. The training is based on randomly drawn inputs and their corresponding outputs as defined by a targetExpand
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
TLDR
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples. Expand
...
1
2
3
4
5
...