# Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

@article{Lee2019WideNN, title={Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent}, author={Jaehoon Lee and Lechao Xiao and Samuel S. Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jascha Sohl-Dickstein}, journal={ArXiv}, year={2019}, volume={abs/1902.06720} }

A longstanding goal in deep learning research has been to precisely characterize training and generalization. [...] Key Result While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions. Expand

#### Paper Mentions

#### 406 Citations

What can linearized neural networks actually say about generalization?

- Computer Science
- ArXiv
- 2021

It is found that during training, deep networks increase the alignment of their empirical NTK with the target task, which explains why linear approximations at the end of training can better explain the dynamics of deep networks. Expand

Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis

- Computer Science, Mathematics
- IEEE Transactions on Information Theory
- 2021

This paper rigorously proves the linear convergence of gradient descent in two weakly-trained and jointly-trained regimes and indicates the considerable benefits of joint training over weak training in finding global optima, achieving a dramatic decrease in the required level of over-parameterization. Expand

Asymptotics of Wide Convolutional Neural Networks

- Computer Science, Physics
- ArXiv
- 2020

It is found that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width, consistent with finite width models generalizing either better or worse than their infinite width counterparts. Expand

Learning Curves for Deep Neural Networks: A field theory perspective

- Computer Science
- 2019

A renormalization-group approach is used to show that noiseless GP inference using NTK, which lacks a good analytical handle, can be well approximated by noisy GP inference on a related kernel the authors call the renormalized NTK. Expand

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

- Computer Science, Mathematics
- NeurIPS
- 2020

A large-scale phenomenological analysis of training reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning. Expand

Disentangling trainability and generalization in deep learning

- Computer Science, Mathematics
- ArXiv
- 2019

This paper discusses challenging issues in the context of wide neural networks at large depths and finds that there are large regions of hyperparameter space where networks can only memorize the training set in the sense they reach perfect training accuracy but completely fail to generalize outside the trainingSet. Expand

On the Optimization Dynamics of Wide Hypernetworks

- Computer Science, Mathematics
- ArXiv
- 2020

This work partially solves an open problem and shows that the convergence rate of the r order term of the Taylor expansion of the cost function, along the optimization trajectories of SGD is n, improving upon the bound suggested by the conjecture of Dyer & Gur-Ari, while matching their empirical observations. Expand

An analytic theory of shallow networks dynamics for hinge loss classification

- Computer Science, Mathematics
- NeurIPS
- 2020

This paper study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task, and shows that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population. Expand

Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel

- Computer Science, Mathematics
- ArXiv
- 2019

This work provides a comprehensive analysis on the impact of the initialization and the activation function on the NTK, and thus on the corresponding training dynamics under SGD, and provides experiments illustrating the theoretical results. Expand

Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes

- Computer Science, Mathematics
- ArXiv
- 2021

It is shown in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization, and the need to further study their global properties is highlighted. Expand

#### References

SHOWING 1-10 OF 59 REFERENCES

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer Science, Mathematics
- ICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

- Computer Science, Physics
- ArXiv
- 2019

This work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures. Expand

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

- Computer Science, Physics
- ICLR
- 2014

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. Expand

Gradient descent optimizes over-parameterized deep ReLU networks

- Computer Science, Mathematics
- Machine Learning
- 2019

The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. Expand

Deep Neural Networks as Gaussian Processes

- Computer Science, Mathematics
- ICLR
- 2018

The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks. Expand

On Lazy Training in Differentiable Programming

- Computer Science
- NeurIPS
- 2019

This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Expand

A Mean Field Theory of Batch Normalization

- Computer Science, Physics
- ICLR
- 2019

The theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function, so vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes. Expand

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

- Computer Science, Mathematics
- ICML
- 2019

It is found that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions, and in the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Expand

Gaussian Process Behaviour in Wide Deep Neural Networks

- Computer Science, Mathematics
- ICLR
- 2018

It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. Expand

Sensitivity and Generalization in Neural Networks: an Empirical Study

- Computer Science, Mathematics
- ICLR
- 2018

It is found that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. Expand