• Corpus ID: 211505957

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

@article{Yang2020RethinkingBT,
  title={Rethinking Bias-Variance Trade-off for Generalization of Neural Networks},
  author={Zitong Yang and Yaodong Yu and Chong You and Jacob Steinhardt and Yi Ma},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.11328}
}
The classical bias-variance trade-off predicts that bias decreases and variance increase with model complexity, leading to a U-shaped risk curve. Recent work calls this into question for neural networks and other over-parameterized models, for which it is often observed that larger models generalize better. We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is… 
Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models
TLDR
Methods from statistical physics are used to derive analytic expressions for bias and variance in three minimal models for over-parameterization (linear regression and two-layer neural networks with linear and nonlinear activation functions), allowing us to disentangle properties stemming from the model architecture and random sampling of data.
Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime
TLDR
A quantitative theory for the double descent of test error in the so-called lazy learning regime of neural networks is developed by considering the problem of learning a high-dimensional function with random features regression, and it is shown that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant.
Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition
TLDR
This work describes an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels, and compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyzes the strikingly rich phenomenology that arises.
Early Stopping in Deep Networks: Double Descent and How to Eliminate it
TLDR
Inspired by this theory, two standard convolutional networks are studied empirically and it is shown that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
What causes the test error? Going beyond bias-variance via ANOVA
TLDR
Using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way, for studying the generalization performance of certain two-layer linear and non-linear networks and advanced deterministic equivalent techniques for Haar random matrices are proposed.
Revisiting complexity and the bias-variance tradeoff
TLDR
This work focuses on the context of linear models, which have been recently used as a stylized tractable approximation to DNNs in high-dimensions, and proposes a novel MDL-based complexity (MDL-COMP), defined via an optimality criterion over the encodings induced by a good Ridge estimator class.
Dimensionality reduction, regularization, and generalization in overparameterized regressions
TLDR
It is shown that OLS is arbitrarily susceptible to data-poisoning attacks in the overparameterization regime -- unlike the underparameterized regime -- and that regularization and dimensionality reduction improve the robustness.
An Investigation of Why Overparameterization Exacerbates Spurious Correlations
TLDR
The analysis leads to a counterintuitive approach of subsampling the majority group, which empirically achieves low minority error in the overparameterized regime, even though the standard approach of upweighting the minority fails.
Kernel regression in high dimension: Refined analysis beyond double descent
TLDR
This refined analysis goes beyond the double descent theory by showing that, depending on the data eigen-profile and the level of regularization, the kernel regression risk curve can be a double-descent-like, bell-shaped, or monotonic function of $n$.
Rethink the Connections among Generalization, Memorization and the Spectral Bias of DNNs
TLDR
It is shown that under the experimental setup of deep double descent, the high-frequency components of DNNs begin to diminish in the second descent, whereas the examples with random labels are still being memorized, and the spectrum of Dnns can be applied to monitoring the test behavior.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
A Modern Take on the Bias-Variance Tradeoff in Neural Networks
TLDR
It is found that both bias and variance can decrease as the number of parameters grows, and a new decomposition of the variance is introduced to disentangle the effects of optimization and data sampling.
High-dimensional dynamics of generalization error in neural networks
Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint
TLDR
The exact population risk of the unregularized least squares regression problem with two-layer neural networks when either the first or the second layer is trained using a gradient flow under different initialization setups is derived.
The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve
TLDR
Deep learning methods operate in regimes that defy the traditional statistical mindset, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise.
From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation
ABSTRACT In statistical prediction, classical approaches for model selection and model evaluation based on covariance penalties are still widely used. Most of the literature on this topic is based on
More Data Can Hurt for Linear Regression: Sample-wise Double Descent
TLDR
A surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples is described: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples, due to an unconventional type of bias-variance tradeoff in the over parameterized regime.
Benign overfitting in linear regression
TLDR
A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
Reconciling modern machine-learning practice and the classical bias–variance trade-off
TLDR
This work shows how classical theory and modern practice can be reconciled within a single unified performance curve and proposes a mechanism underlying its emergence, and provides evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets.
A Model of Double Descent for High-dimensional Binary Linear Classification
TLDR
A model for logistic regression where only a subset of features of size p is used for training a linear classifier over n training samples is considered, and a phase-transition phenomenon for the case of Gaussian features is uncovered.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
...
...