• Corpus ID: 237304082

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

@article{Chatterji2021TheIB,
title={The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks},
author={Niladri S. Chatterji and Philip M. Long and Peter L. Bartlett},
journal={ArXiv},
year={2021},
volume={abs/2108.11489}
}
• Published 25 August 2021
• Computer Science
• ArXiv
The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly ﬁt noisy data can generalize well to unseen test data. Understanding this phenomenon of benign overﬁtting has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient ﬂow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub…
6 Citations
• Computer Science
COLT
• 2022
This work considers the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization and shows that in this setting, neural networks exhibit benign overﬁtting: they can be driven to zero training error, perfectly matching any noisy training labels, and simultaneously achieve minimax optimal test error.
• Computer Science
ArXiv
• 2022
It is argued that many real interpolating methods like neural networks do not fit benignly : modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime.
• Computer Science
ArXiv
• 2022
It is established that, by adding well deﬁned layers to an underparameterized DCNN, one can obtain some interpolating DCNNs that maintain the good learning rates of the underparametersize DCNN.
• Computer Science
ArXiv
• 2022
Using a general converse Lyapunov like theorem, a unified analysis for GD/SGD is provided not only for classical settings like convex losses, or objectives that satisfy PL/ KL properties, but also for more complex problems including Phase Retrieval and Matrix sq-root.
• Computer Science
ArXiv
• 2022
It is shown that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum 𝓁 2 -norm interpolant, and it is revealed that interpolating deep linear models have exactly the same conditional variance as the minimum -norm solution.
• Computer Science
• 2022
This work shows that under a sufﬁciently large noise-to-sample size ratio, generalization error eventually increases with model size, and empirically observes that the adverse effect of network size is more pronounced when robust training methods are employed to learn from noisy-labeled data.

References

SHOWING 1-10 OF 48 REFERENCES

• Computer Science
ArXiv
• 2021
It is shown that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to this knowledge is the first generalization result for such networks.
• Computer Science
ICML
• 2020
This work measures the bias and variance of neural networks and finds that deeper models decrease bias and increase variance for both in-dist distribution and out-of-distribution data, and corroborates these empirical results with a theoretical analysis of two-layer linear networks with random first layer.
• Computer Science
• 2020
This work provides non-asymptotic generalization bounds for overparametrized ridge regression that depend on the arbitrary covariance structure of the data, and shows that those bounds are tight for a range of regularization parameter values.
• Computer Science
2019 IEEE International Symposium on Information Theory (ISIT)
• 2019
A bound on how well such interpolative solutions can generalize to fresh test data is given, and it is shown that this bound generically decays to zero with the number of extra features, thus characterizing an explicit benefit of overparameterization.
• Computer Science
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
• 2021
This paper studies benign-overfitting for data generated from a popular binary Gaussian mixtures model (GMM) and classifiers trained by support-vector machines (SVM) to derive novel non-asymptotic bounds on the classification error of LS solution.
• Computer Science
ICLR
• 2017
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
• Computer Science
COLT
• 2020
This work shows how the scale of the initialization controls the transition between the "kernel" and "rich" regimes and affects generalization properties in multilayer homogeneous models and highlights an interesting role for the width of a model in the case that the predictor is not identically zero at initialization.
• Computer Science
J. Mach. Learn. Res.
• 2018
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the
• Computer Science
Acta Numerica
• 2021
This article surveys recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings, and focuses specifically on the linear regime for neural networks, where the network can be approximated by a linear model.
• Computer Science, Mathematics
ICLR
• 2021
The implicit bias of gradient flow is studied on linear neural network training, and it is proved that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell-2$ norms in the transformed input space.