The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

@article{Mei2019TheGE,
  title={The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve},
  author={Song Mei and A. Montanari},
  journal={arXiv: Statistics Theory},
  year={2019}
}
Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data. This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity… Expand

Figures from this paper

Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime
TLDR
A quantitative theory for the double descent of test error in the so-called lazy learning regime of neural networks is developed by considering the problem of learning a high-dimensional function with random features regression, and it is shown that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant. Expand
The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
TLDR
This work provides a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. Expand
The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training
TLDR
It is shown that the network approximately performs ridge regression in the raw features, with a strictly positive `self-induced' regularization in the context of two-layers neural networks in the neural tangent (NT) regime. Expand
Towards an Understanding of Benign Overfitting in Neural Networks
TLDR
It is shown that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to this knowledge is the first generalization result for such networks. Expand
Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning
TLDR
This work shows that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense, and shows that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures. Expand
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the goodExpand
Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition
TLDR
This work describes an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels, and compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyzes the strikingly rich phenomenology that arises. Expand
Dimensionality reduction, regularization, and generalization in overparameterized regressions
TLDR
It is shown that OLS is arbitrarily susceptible to data-poisoning attacks in the overparameterization regime -- unlike the underparameterized regime -- and that regularization and dimensionality reduction improve the robustness. Expand
Generalization Error of Generalized Linear Models in High Dimensions
TLDR
This work provides a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. Expand
Benign overfitting in ridge regression
Classical learning theory suggests that strong regularization is needed to learn a class with large complexity. This intuition is in contrast with the modern practice of machine learning, inExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 86 REFERENCES
Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks
TLDR
Focusing on shallow neural nets and smooth activations, it is shown that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Expand
High-dimensional dynamics of generalization error in neural networks
TLDR
It is found that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks, and standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks. Expand
To understand deep learning we need to understand kernel learning
TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods. Expand
Reconciling modern machine learning and the bias-variance trade-off
TLDR
A new "double descent" risk curve is exhibited that extends the traditional U-shaped bias-variance curve beyond the point of interpolation and shows that the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. Expand
On Lazy Training in Differentiable Programming
TLDR
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Expand
Harmless Interpolation of Noisy Data in Regression
TLDR
It is shown that the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise decays to zero with the number of features, and overparameterization can be beneficial in ensuring harmless interpolation of noise. Expand
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand
...
1
2
3
4
5
...