High-dimensional dynamics of generalization error in neural networks

@article{Advani2020HighdimensionalDO,
  title={High-dimensional dynamics of generalization error in neural networks},
  author={Madhu S. Advani and Andrew M. Saxe},
  journal={Neural Networks},
  year={2020},
  volume={132},
  pages={428 - 446}
}

Figures from this paper

An analytic theory of shallow networks dynamics for hinge loss classification
TLDR
This paper study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task, and shows that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population.
Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint
TLDR
The exact population risk of the unregularized least squares regression problem with two-layer neural networks when either the first or the second layer is trained using a gradient flow under different initialization setups is derived.
Multi-scale Feature Learning Dynamics: Insights for Double Descent
TLDR
This work investigates the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases, and derives closed-form analytical expressions for the evolution of generalization error over training.
A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks
TLDR
It is shown from a dynamical system perspective that the Heavy Ball method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) only converges toglobal minimum sublinearly.
Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model
TLDR
Asymptotic distribution results for this 2-layers NN model are established, and the double-descent phenomenon occurs in terms of the length of the CIs, with the length increasing and then decreasing as d n ↗ +∞ for certain fixed values of p n .
A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions
TLDR
This work analyzes the performance of random feature regression with features F = f ( WX + B ) for a random weight matrix W and bias vector B, obtaining exact formulae for the asymptotic training and test errors for data generated by a linear teacher model.
Generalization Error of Generalized Linear Models in High Dimensions
TLDR
This work provides a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems.
Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks
TLDR
This work studies the discrete gradient dynamics of the training of a two-layer linear network with the least-squares loss using a time rescaling to show that this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank.
A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning
TLDR
Intriguingly, it is found that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoecndoing task, suggesting that mixtures of non linearities might be useful for approximate kernel methods or neural network architecture design.
Understanding Generalization in Recurrent Neural Networks
TLDR
This work proposes to add random noise to the input data and proves a generalization bound for training with random noise, which is an extension of the former one, and discovers that Fisher-Rao norm for RNNs can be interpreted as a measure of gradient, and incorporating this gradient measure can tighten the bound.
...
...

References

SHOWING 1-10 OF 76 REFERENCES
Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint
TLDR
The exact population risk of the unregularized least squares regression problem with two-layer neural networks when either the first or the second layer is trained using a gradient flow under different initialization setups is derived.
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
TLDR
It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
TLDR
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Temporal Evolution of Generalization during Learning in Linear Networks
TLDR
It is shown that the behavior of the validation function depends critically on the initial conditions and on the characteristics of the noise, and that under certain simple assumptions, if the initial weights are sufficiently small, the validate function has a unique minimum corresponding to an optimal stopping time for training.
Statistical mechanics of learning from examples.
TLDR
It is shown that for smooth networks, i.e., those with continuously varying weights and smooth transfer functions, the generalization curve asymptotically obeys an inverse power law, while for nonsmooth networks other behaviors can appear, depending on the nature of the nonlinearities as well as the realizability of the rule.
Generalization Dynamics in LMS Trained Linear Networks
TLDR
For a speech labeling task, predicted weaving effects were qualitatively tested and observed by computer simulations in networks trained by the linear and non-linear back-propagation algorithm.
On Lazy Training in Differentiable Programming
TLDR
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
Exponentially vanishing sub-optimal local minima in multilayer neural networks
TLDR
It is proved that, with high probability in the limit of $N\rightarrow\infty$ datapoints, the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
A mean field view of the landscape of two-layer neural networks
TLDR
A compact description of the SGD dynamics is derived in terms of a limiting partial differential equation that allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.
...
...