Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

@article{Mallinar2022BenignTO,
  title={Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting},
  author={Neil Rohit Mallinar and James B. Simon and Amirhesam Abedsoltan and Parthe Pandit and Mikhail Belkin and Preetum Nakkiran},
  journal={ArXiv},
  year={2022},
  volume={abs/2207.06569}
}
The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods , which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting , a phenomenon where some interpolating methods… 

The Final Ascent: When Bigger Models Generalize Worse on Noisy-Labeled Data

This work shows that under a sufficiently large noise-to-sample size ratio, generalization error eventually increases with model size, and empirically observes that the adverse effect of network size is more pronounced when robust training methods are employed to learn from noisy-labeled data.

Deep Linear Networks can Benignly Overfit when Shallow Ones Do

It is shown that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum 𝓁 2 -norm interpolant, and it is revealed that interpolating deep linear models have exactly the same conditional variance as the minimum -norm solution.

The Eigenlearning Framework: A Conservation Law Perspective on Kernel Regression and Wide Neural Networks

A simple unified framework giving closed-form estimates for the test risk and other generalization metrics of kernel ridge regression is derived, enabled by the identification of a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions.

Learning from few examples with nonlinear feature maps

This work considers the problem of data classification where the training set consists of just a few data points and reveals key relationships between the geometry of an AI model’s feature space, the structure of the underlying data distributions, and the model's generalisation capabilities.

Generalizing with overly complex representations

Representations enable cognitive systems to generalize from known experiences to the new ones. Simplicity of a representation has been linked to its generalization ability. Conventionally, simple

References

SHOWING 1-10 OF 68 REFERENCES

Benign overfitting in linear regression

A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.

Deep learning: a statistical viewpoint

This article surveys recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings, and focuses specifically on the linear regime for neural networks, where the network can be approximated by a linear model.

Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

This work considers the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization and shows that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly matching any noisy training labels, and simultaneously achieve minimax optimal test error.

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems.

To understand deep learning we need to understand kernel learning

It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

Interpolating two-layer linear neural networks trained with gradient on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub- Gaussian.

Understanding deep learning (still) requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity.

Harmless interpolation of noisy data in regression

A bound on how well such interpolative solutions can generalize to fresh test data is given, and it is shown that this bound generically decays to zero with the number of extra features, thus characterizing an explicit benefit of overparameterization.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Reconciling modern machine-learning practice and the classical bias–variance trade-off

This work shows how classical theory and modern practice can be reconciled within a single unified performance curve and proposes a mechanism underlying its emergence, and provides evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets.
...