Corpus ID: 182953045

Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation

@article{Mitra2019UnderstandingOP,
  title={Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation},
  author={P. Mitra},
  journal={ArXiv},
  year={2019},
  volume={abs/1906.03667}
}
  • P. Mitra
  • Published 2019
  • Computer Science, Physics, Mathematics
  • ArXiv
Traditionally in regression one minimizes the number of fitting parameters or uses smoothing/regularization to trade training (TE) and generalization error (GE). Driving TE to zero by increasing fitting degrees of freedom (dof) is expected to increase GE. However modern big-data approaches, including deep nets, seem to over-parametrize and send TE to zero (data interpolation) without impacting GE. Overparametrization has the benefit that global minima of the empirical loss function proliferate… Expand
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the goodExpand
The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
TLDR
This work provides a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. Expand
Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree
TLDR
To the best of the literature, this is the first result in the literature showing that, without any explicit regularization, the test errors of a practical-to-compute overfitting solution can exhibit double-descent and approach the order of the noise level independently of the null risk. Expand
Harmless Interpolation of Noisy Data in Regression
TLDR
It is shown that the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise decays to zero with the number of features, and overparameterization can be beneficial in ensuring harmless interpolation of noise. Expand
Benign overfitting in ridge regression
Classical learning theory suggests that strong regularization is needed to learn a class with large complexity. This intuition is in contrast with the modern practice of machine learning, inExpand
Fitting Elephants
  • P. Mitra
  • Computer Science, Biology
  • ArXiv
  • 2021
TLDR
This article elucidates Statistically Consistent Interpolation (SCI) using the weighted interpolating nearest neighbors (wiNN) algorithm, which adds singular weight functions to kNN (k-nearest neighbors), and shows that data interpolation can be a valid ML strategy for big data. Expand
Fitting elephants in modern machine learning by statistically consistent interpolation
TLDR
This work elucidates statistically consistent interpolation (SCI) using the weighted interpolating nearest neighbours algorithm, which adds singular weight functions to k nearest neighbours, and discusses how SCI elucidates the differing approaches to modelling natural phenomena represented in modern machine learning, traditional physical theory and biological brains. Expand
Harmless interpolation of noisy data in regression
TLDR
A bound on how well such interpolative solutions can generalize to fresh test data is given, and it is shown that this bound generically decays to zero with the number of extra features, thus characterizing an explicit benefit of overparameterization. Expand
Double Descent Optimization Pattern and Aliasing: Caveats of Noisy Labels
TLDR
It is shown that noisy labels must be present both in the training and generalization sets to observe a double descent pattern, and the learning rate has an influence on double descent, and how different optimizers and optimizer parameters influence the apparition of double descent is studied. Expand
Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition
TLDR
This work describes an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels, and compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyzes the strikingly rich phenomenology that arises. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 28 REFERENCES
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
TLDR
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
TLDR
A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems. Expand
To understand deep learning we need to understand kernel learning
TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods. Expand
Reconciling modern machine learning and the bias-variance trade-off
TLDR
A new "double descent" risk curve is exhibited that extends the traditional U-shaped bias-variance curve beyond the point of interpolation and shows that the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. Expand
Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit
  • P. Mitra
  • Mathematics, Computer Science
  • 2018 26th European Signal Processing Conference (EUSIPCO)
  • 2018
TLDR
In contrast with previous usage of similar penalty functions to enforce consensus between nodes, in the interpolating limit it is not required to take the penalty parameter to infinity for consensus to occur, which reinforces the utility of the interpolation limit in the theoretical treatment of modern machine learning algorithms. Expand
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand
Reconciling modern machine learning practice and the bias-variance trade-off
TLDR
This paper reconciles the classical understanding and the modern practice within a unified performance curve that subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. Expand
The jamming transition as a paradigm to understand the loss landscape of deep neural networks
TLDR
It is argued that in fully connected deep networks a phase transition delimits the over- and underparametrized regimes where fitting can or cannot be achieved, and observed that the ability of fully connected networks to fit random data is independent of their depth, an independence that appears to also hold for real data. Expand
Just Interpolate: Kernel "Ridgeless" Regression Can Generalize
TLDR
This work isolates a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. Expand
High-dimensional dynamics of generalization error in neural networks
TLDR
It is found that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks, and standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks. Expand
...
1
2
3
...