# Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation

@article{Mitra2019UnderstandingOP, title={Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation}, author={P. Mitra}, journal={ArXiv}, year={2019}, volume={abs/1906.03667} }

Traditionally in regression one minimizes the number of fitting parameters or uses smoothing/regularization to trade training (TE) and generalization error (GE). Driving TE to zero by increasing fitting degrees of freedom (dof) is expected to increase GE. However modern big-data approaches, including deep nets, seem to over-parametrize and send TE to zero (data interpolation) without impacting GE. Overparametrization has the benefit that global minima of the empirical loss function proliferate… Expand

#### 35 Citations

A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

- Mathematics, Computer Science
- 2021

The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good… Expand

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

- Computer Science, Mathematics
- ICML
- 2020

This work provides a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. Expand

Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree

- Computer Science, Mathematics
- ArXiv
- 2020

To the best of the literature, this is the first result in the literature showing that, without any explicit regularization, the test errors of a practical-to-compute overfitting solution can exhibit double-descent and approach the order of the noise level independently of the null risk. Expand

Harmless Interpolation of Noisy Data in Regression

- Computer Science
- IEEE Journal on Selected Areas in Information Theory
- 2020

It is shown that the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise decays to zero with the number of features, and overparameterization can be beneficial in ensuring harmless interpolation of noise. Expand

Benign overfitting in ridge regression

- Mathematics
- 2020

Classical learning theory suggests that strong regularization is needed to learn a class with large complexity. This intuition is in contrast with the modern practice of machine learning, in… Expand

Fitting Elephants

- Computer Science, Biology
- ArXiv
- 2021

This article elucidates Statistically Consistent Interpolation (SCI) using the weighted interpolating nearest neighbors (wiNN) algorithm, which adds singular weight functions to kNN (k-nearest neighbors), and shows that data interpolation can be a valid ML strategy for big data. Expand

Fitting elephants in modern machine learning by statistically consistent interpolation

- Computer Science
- 2021

This work elucidates statistically consistent interpolation (SCI) using the weighted interpolating nearest neighbours algorithm, which adds singular weight functions to k nearest neighbours, and discusses how SCI elucidates the differing approaches to modelling natural phenomena represented in modern machine learning, traditional physical theory and biological brains. Expand

Harmless interpolation of noisy data in regression

- Computer Science, Mathematics
- 2019 IEEE International Symposium on Information Theory (ISIT)
- 2019

A bound on how well such interpolative solutions can generalize to fresh test data is given, and it is shown that this bound generically decays to zero with the number of extra features, thus characterizing an explicit benefit of overparameterization. Expand

Double Descent Optimization Pattern and Aliasing: Caveats of Noisy Labels

- Computer Science
- ArXiv
- 2021

It is shown that noisy labels must be present both in the training and generalization sets to observe a double descent pattern, and the learning rate has an influence on double descent, and how different optimizers and optimizer parameters influence the apparition of double descent is studied. Expand

Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition

- Mathematics, Computer Science
- NeurIPS
- 2020

This work describes an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels, and compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyzes the strikingly rich phenomenology that arises. Expand

#### References

SHOWING 1-10 OF 28 REFERENCES

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

- Computer Science, Mathematics
- ICML
- 2018

The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

- Mathematics, Physics
- NeurIPS
- 2018

A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems. Expand

To understand deep learning we need to understand kernel learning

- Computer Science, Mathematics
- ICML
- 2018

It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods. Expand

Reconciling modern machine learning and the bias-variance trade-off

- Computer Science, Mathematics
- ArXiv
- 2018

A new "double descent" risk curve is exhibited that extends the traditional U-shaped bias-variance curve beyond the point of interpolation and shows that the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. Expand

Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit

- Mathematics, Computer Science
- 2018 26th European Signal Processing Conference (EUSIPCO)
- 2018

In contrast with previous usage of similar penalty functions to enforce consensus between nodes, in the interpolating limit it is not required to take the penalty parameter to infinity for consensus to occur, which reinforces the utility of the interpolation limit in the theoretical treatment of modern machine learning algorithms. Expand

Understanding deep learning requires rethinking generalization

- Computer Science
- ICLR
- 2017

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand

Reconciling modern machine learning practice and the bias-variance trade-off

- Computer Science
- 2018

This paper reconciles the classical understanding and the modern practice within a unified performance curve that subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. Expand

The jamming transition as a paradigm to understand the loss landscape of deep neural networks

- Computer Science, Medicine
- Physical review. E
- 2019

It is argued that in fully connected deep networks a phase transition delimits the over- and underparametrized regimes where fitting can or cannot be achieved, and observed that the ability of fully connected networks to fit random data is independent of their depth, an independence that appears to also hold for real data. Expand

Just Interpolate: Kernel "Ridgeless" Regression Can Generalize

- Mathematics, Computer Science
- ArXiv
- 2018

This work isolates a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. Expand

High-dimensional dynamics of generalization error in neural networks

- Computer Science, Mathematics
- Neural Networks
- 2020

It is found that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks, and standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks. Expand