# Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup

@article{Goldt2019DynamicsOS, title={Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup}, author={Sebastian Goldt and Madhu S. Advani and Andrew M. Saxe and F. Krzakala and L. Zdeborov{\'a}}, journal={Journal of Statistical Mechanics (Online)}, year={2019}, volume={2020} }

Abstract Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher–student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations… Expand

#### 28 Citations

Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network

- Mathematics, Computer Science
- ArXiv
- 2019

It is proved that when the gradient is zero at every data point in training, there exists many-to-one alignment between student and teacher nodes in the lowest layer under mild conditions, which suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Expand

The Gaussian equivalence of generative models for learning with two-layer neural networks

- Computer Science
- ArXiv
- 2020

This work establishes rigorous conditions under which a class of generative models shares key statistical properties with an appropriately chosen Gaussian feature model, and uses this Gaussian equivalence theorem (GET) to derive a closed set of equations that describe the dynamics of two-layer neural networks trained using one-pass stochastic gradient descent on data drawn from a large class of generators. Expand

Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft Committee Machines

- Computer Science, Physics
- ArXiv
- 2021

For on-line learning of a two-layer soft committee machine in the over-realizable case, this work finds that the approach to perfect learning occurs in a power-law fashion rather than exponentially as in the realizable case. Expand

LEARNING ONE-HIDDEN-LAYER NEURAL NETWORKS ON GAUSSIAN MIXTURE MODELS WITH GUARAN-

- 2020

We analyze the learning problem of fully connected neural networks with the sigmoid activation function for binary classification in the teacher-student setup, where the outputs are assumed to be… Expand

Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting

- Mathematics, Computer Science
- 2020

A teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks, and a sharp rate of convergence is obtained, i.e., Õd(n −2/3)∗, for classifiers trained using either 0-1 loss or hinge loss. Expand

SERVATION LAWS IN DEEP LEARNING DYNAMICS

- 2021

Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a… Expand

Representation mitosis in wide neural networks

- Computer Science, Mathematics
- ArXiv
- 2021

It is shown that a key ingredient to activate mitosis is continuing the training process until the training error is zero, and that in one of the learning tasks, a wide model with several automatically developed clones performs significantly better than a deep ensemble based on architectures in which the last layer has the same size as the clones. Expand

Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural Networks

- Computer Science, Mathematics
- ArXiv
- 2021

This work studies the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, and develops a novel set of tools for studying families of spurious minima using powerful tools from equivariant bifurcation theory. Expand

Understanding Diversity based Pruning of Neural Networks - Statistical Mechanical Analysis

- Computer Science
- ArXiv
- 2020

This work sets up the problem in the statistical mechanics formulation of a teacher-student framework and deriving generalization error (GE) bounds of specific pruning methods to prove that baseline random edge pruning method performs better than DPP node pruned method and proposes a DPP edge pruned technique for neural networks which empirically outperforms other competing pruned methods on real datasets. Expand

If deep learning is the answer, what is the question?

- Medicine, Computer Science
- Nature reviews. Neuroscience
- 2020

A road map of how neuroscientists can use deep networks to model and understand biological brains is offered to offer a road map for systems neuroscience research in the age of deep learning. Expand

#### References

SHOWING 1-10 OF 70 REFERENCES

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

- Computer Science, Mathematics
- NeurIPS
- 2019

It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand

High-dimensional dynamics of generalization error in neural networks

- Computer Science, Mathematics
- Neural Networks
- 2020

It is found that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks, and standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks. Expand

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

- Computer Science, Physics
- ICLR
- 2014

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. Expand

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Computer Science, Mathematics
- ICLR
- 2019

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer Science, Mathematics
- ICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand

Gradient descent optimizes over-parameterized deep ReLU networks

- Computer Science, Mathematics
- Machine Learning
- 2019

The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. Expand

On Lazy Training in Differentiable Programming

- Computer Science
- NeurIPS
- 2019

This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Expand

Comparing Dynamics: Deep Neural Networks versus Glassy Systems

- Computer Science, Mathematics
- ICML
- 2018

Numerically the training dynamics of deep neural networks (DNN) are analyzed by using methods developed in statistical physics of glassy systems to suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. Expand

Learning by on-line gradient descent

- Mathematics
- 1995

We study on-line gradient-descent learning in multilayer networks analytically and numerically. The training is based on randomly drawn inputs and their corresponding outputs as defined by a target… Expand

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

- Computer Science, Mathematics
- UAI
- 2017

By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples. Expand