# Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

@article{Lee2019WideNN,
title={Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent},
author={Jaehoon Lee and Lechao Xiao and Samuel S. Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jascha Sohl-Dickstein},
journal={ArXiv},
year={2019},
volume={abs/1902.06720}
}
A longstanding goal in deep learning research has been to precisely characterize training and generalization. [...] Key Result While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.Expand
406 Citations

#### Paper Mentions

What can linearized neural networks actually say about generalization?
• Computer Science
• ArXiv
• 2021
It is found that during training, deep networks increase the alignment of their empirical NTK with the target task, which explains why linear approximations at the end of training can better explain the dynamics of deep networks. Expand
Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis
• Computer Science, Mathematics
• IEEE Transactions on Information Theory
• 2021
This paper rigorously proves the linear convergence of gradient descent in two weakly-trained and jointly-trained regimes and indicates the considerable benefits of joint training over weak training in finding global optima, achieving a dramatic decrease in the required level of over-parameterization. Expand
Asymptotics of Wide Convolutional Neural Networks
• Computer Science, Physics
• ArXiv
• 2020
It is found that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width, consistent with finite width models generalizing either better or worse than their infinite width counterparts. Expand
Learning Curves for Deep Neural Networks: A field theory perspective
• Computer Science
• 2019
A renormalization-group approach is used to show that noiseless GP inference using NTK, which lacks a good analytical handle, can be well approximated by noisy GP inference on a related kernel the authors call the renormalized NTK. Expand
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
• Computer Science, Mathematics
• NeurIPS
• 2020
A large-scale phenomenological analysis of training reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning. Expand
Disentangling trainability and generalization in deep learning
• Computer Science, Mathematics
• ArXiv
• 2019
This paper discusses challenging issues in the context of wide neural networks at large depths and finds that there are large regions of hyperparameter space where networks can only memorize the training set in the sense they reach perfect training accuracy but completely fail to generalize outside the trainingSet. Expand
On the Optimization Dynamics of Wide Hypernetworks
• Computer Science, Mathematics
• ArXiv
• 2020
This work partially solves an open problem and shows that the convergence rate of the r order term of the Taylor expansion of the cost function, along the optimization trajectories of SGD is n, improving upon the bound suggested by the conjecture of Dyer & Gur-Ari, while matching their empirical observations. Expand
An analytic theory of shallow networks dynamics for hinge loss classification
• Computer Science, Mathematics
• NeurIPS
• 2020
This paper study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task, and shows that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population. Expand
Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel
• Computer Science, Mathematics
• ArXiv
• 2019
This work provides a comprehensive analysis on the impact of the initialization and the activation function on the NTK, and thus on the corresponding training dynamics under SGD, and provides experiments illustrating the theoretical results. Expand
Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes
• Computer Science, Mathematics
• ArXiv
• 2021
It is shown in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization, and the need to further study their global properties is highlighted. Expand

#### References

SHOWING 1-10 OF 59 REFERENCES
A Convergence Theory for Deep Learning via Over-Parameterization
• Computer Science, Mathematics
• ICML
• 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation
This work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures. Expand
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
• Computer Science, Physics
• ICLR
• 2014
It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. Expand
Gradient descent optimizes over-parameterized deep ReLU networks
• Computer Science, Mathematics
• Machine Learning
• 2019
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. Expand
Deep Neural Networks as Gaussian Processes
• Computer Science, Mathematics
• ICLR
• 2018
The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks. Expand
On Lazy Training in Differentiable Programming
• Computer Science
• NeurIPS
• 2019
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Expand
A Mean Field Theory of Batch Normalization
• Computer Science, Physics
• ICLR
• 2019
The theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function, so vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes. Expand
The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
• Computer Science, Mathematics
• ICML
• 2019
It is found that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions, and in the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Expand
Gaussian Process Behaviour in Wide Deep Neural Networks
• Computer Science, Mathematics
• ICLR
• 2018
It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. Expand
Sensitivity and Generalization in Neural Networks: an Empirical Study
• Computer Science, Mathematics
• ICLR
• 2018
It is found that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. Expand