Deep double descent: where bigger models and more data hurt

@article{Nakkiran2019DeepDD,
  title={Deep double descent: where bigger models and more data hurt},
  author={Preetum Nakkiran and Gal Kaplun and Yamini Bansal and Tristan Yang and Boaz Barak and Ilya Sutskever},
  journal={Journal of Statistical Mechanics: Theory and Experiment},
  year={2019},
  volume={2021}
}
We show that a variety of modern deep learning tasks exhibit a ‘double-descent’ phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure… 

Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Inspired by this theory, two standard convolutional networks are studied empirically and it is shown that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.

When and how epochwise double descent happens

This work develops an analytically tractable model of epochwise double descent that allows us to characterise theoretically when this effect is likely to occur and shows experimentally that deep neural networks behave similarly to the theoretical model.

Mitigating Deep Double Descent by Concatenating Inputs

This work proposes a construction which augments the existing dataset by artificially increasing the number of samples, and empirically mitigates the double descent curve in this setting.

Multi-scale Feature Learning Dynamics: Insights for Double Descent

This work investigates the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or de-scents as the training time increases, and derives closed-form analytical expressions describing the generalization error in terms of low-dimensional scalar macroscopic variables.

Understanding the double descent phenomenon

This lecture explains the concept of double descent introduced by [4], and its mechanisms, and introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer.

Comprehensive Understanding of Double Descent

We focus on the phenomenon of double descent in deep learning wherein when we increase model size or the number of epochs, performance on the test set initially improves (as expected), then worsens

Sparse Double Descent: Where Network Pruning Aggravates Overfitting

A novel learning distance interpretation that the curve of ℓ 2 learning distance of sparse models (from initialized parameters to final parameters) may correlate with the sparse double descent curve well and reflect generalization better than minima flatness is proposed.

VC Theoretical Explanation of Double Descent

This paper presents a VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC-generalization bounds and illustrates an application of analytic VC-bounds for modeling double descent for classification problems, using empirical results for several learning methods.

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

Effective dimensionality is related to posterior contraction in Bayesian deep learning, model selection, width-depth tradeoffs, double descent, and functional diversity in loss surfaces, leading to a richer understanding of the interplay between parameters and functions in deep models.

Double Descent Optimization Pattern and Aliasing: Caveats of Noisy Labels

It is shown that noisy labels must be present both in the training and generalization sets to observe a double descent pattern, and the learning rate has an influence on double descent, and how different optimizers and optimizer parameters influence the apparition of double descent is studied.
...

References

SHOWING 1-10 OF 51 REFERENCES

SGD on Neural Networks Learns Functions of Increasing Complexity

Key to the work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information, which can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime.

Scaling description of generalization with number of parameters in deep learning

This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affects the generalization error for classification.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Two models of double descent for weak features

The "double descent" risk curve was recently proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models and it is shown that the risk peaks when the number of features is close to the sample size, but also that therisk decreases towards its minimum as $p$ increases beyond $n$.

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Deep learning methods operate in regimes that defy the traditional statistical mindset, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise.

Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation

A generative and fitting model pair is introduced and it is shown that the overfitting peak can be dissociated from the point at which the fitting function gains enough dof's to match the data generative model and thus provides good generalization.

Benign overfitting in linear regression

A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.

The Dipping Phenomenon

It is demonstrated that there are classification problems on which particular classifiers attain their optimum performance at a training set size which is finite and whether or not this phenomenon can be observed depends on the choice of classifier in relation to the underlying class distributions.

Reconciling modern machine learning and the bias-variance trade-off

A new "double descent" risk curve is exhibited that extends the traditional U-shaped bias-variance curve beyond the point of interpolation and shows that the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models.
...