High-dimensional dynamics of generalization error in neural networks

@article{Advani2020HighdimensionalDO,
  title={High-dimensional dynamics of generalization error in neural networks},
  author={Madhu S. Advani and Andrew M. Saxe},
  journal={Neural Networks},
  year={2020},
  volume={132},
  pages={428 - 446}
}

Figures from this paper

An analytic theory of shallow networks dynamics for hinge loss classification
TLDR
This paper study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task, and shows that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population.
Scaling description of generalization with number of parameters in deep learning
TLDR
This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affects the generalization error for classification.
Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint
TLDR
The exact population risk of the unregularized least squares regression problem with two-layer neural networks when either the first or the second layer is trained using a gradient flow under different initialization setups is derived.
Multi-scale Feature Learning Dynamics: Insights for Double Descent
TLDR
This work investigates the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases, and derives closed-form analytical expressions for the evolution of generalization error over training.
A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks
TLDR
It is shown from a dynamical system perspective that the Heavy Ball method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) only converges toglobal minimum sublinearly.
Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model
TLDR
Asymptotic distribution results for this 2-layers NN model are established, and the double-descent phenomenon occurs in terms of the length of the CIs, with the length increasing and then decreasing as d n ↗ +∞ for certain fixed values of p n .
A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions
TLDR
This work analyzes the performance of random feature regression with features F = f ( WX + B ) for a random weight matrix W and bias vector B, obtaining exact formulae for the asymptotic training and test errors for data generated by a linear teacher model.
Generalization Error of Generalized Linear Models in High Dimensions
TLDR
This work provides a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems.
A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning
TLDR
Intriguingly, it is found that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoecndoing task, suggesting that mixtures of non linearities might be useful for approximate kernel methods or neural network architecture design.
Understanding Generalization in Recurrent Neural Networks
TLDR
This work proposes to add random noise to the input data and proves a generalization bound for training with random noise, which is an extension of the former one, and discovers that Fisher-Rao norm for RNNs can be interpreted as a measure of gradient, and incorporating this gradient measure can tighten the bound.
...
...

References

SHOWING 1-10 OF 76 REFERENCES
Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint
TLDR
The exact population risk of the unregularized least squares regression problem with two-layer neural networks when either the first or the second layer is trained using a gradient flow under different initialization setups is derived.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
TLDR
This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Temporal Evolution of Generalization during Learning in Linear Networks
TLDR
It is shown that the behavior of the validation function depends critically on the initial conditions and on the characteristics of the noise, and that under certain simple assumptions, if the initial weights are sufficiently small, the validate function has a unique minimum corresponding to an optimal stopping time for training.
Generalization Dynamics in LMS Trained Linear Networks
TLDR
For a speech labeling task, predicted weaving effects were qualitatively tested and observed by computer simulations in networks trained by the linear and non-linear back-propagation algorithm.
On Lazy Training in Differentiable Programming
TLDR
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
Exponentially vanishing sub-optimal local minima in multilayer neural networks
TLDR
It is proved that, with high probability in the limit of $N\rightarrow\infty$ datapoints, the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Nonlinear random matrix theory for deep learning
TLDR
This work demonstrates that the pointwise nonlinearities typically applied in neural networks can be incorporated into a standard method of proof in random matrix theory known as the moments method, and identifies an intriguing new class of activation functions with favorable properties.
Effect of Batch Learning in Multilayer Neural Networks
TLDR
Experimental study on multilayer perceptrons and linear neural networks (LNN) shows that batch learning induces strong overtrain-ing on both models in overrealizable cases, which means the degrade of generalization error by surplus units can be alleviated.
...
...