• Corpus ID: 237304082

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

  title={The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks},
  author={Niladri S. Chatterji and Philip M. Long and Peter L. Bartlett},
The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of benign overfitting has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub… 

Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

This work considers the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization and shows that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly matching any noisy training labels, and simultaneously achieve minimax optimal test error.

Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

It is argued that many real interpolating methods like neural networks do not fit benignly : modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime.

Learning Ability of Interpolating Convolutional Neural Networks

It is established that, by adding well defined layers to an underparameterized DCNN, one can obtain some interpolating DCNNs that maintain the good learning rates of the underparametersize DCNN.

From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent

Using a general converse Lyapunov like theorem, a unified analysis for GD/SGD is provided not only for classical settings like convex losses, or objectives that satisfy PL/ KL properties, but also for more complex problems including Phase Retrieval and Matrix sq-root.

Deep Linear Networks can Benignly Overfit when Shallow Ones Do

It is shown that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum 𝓁 2 -norm interpolant, and it is revealed that interpolating deep linear models have exactly the same conditional variance as the minimum -norm solution.

The Final Ascent: When Bigger Models Generalize Worse on Noisy-Labeled Data

This work shows that under a sufficiently large noise-to-sample size ratio, generalization error eventually increases with model size, and empirically observes that the adverse effect of network size is more pronounced when robust training methods are employed to learn from noisy-labeled data.



Towards an Understanding of Benign Overfitting in Neural Networks

It is shown that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to this knowledge is the first generalization result for such networks.

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

This work measures the bias and variance of neural networks and finds that deeper models decrease bias and increase variance for both in-dist distribution and out-of-distribution data, and corroborates these empirical results with a theoretical analysis of two-layer linear networks with random first layer.

Benign overfitting in ridge regression

This work provides non-asymptotic generalization bounds for overparametrized ridge regression that depend on the arbitrary covariance structure of the data, and shows that those bounds are tight for a range of regularization parameter values.

Harmless interpolation of noisy data in regression

A bound on how well such interpolative solutions can generalize to fresh test data is given, and it is shown that this bound generically decays to zero with the number of extra features, thus characterizing an explicit benefit of overparameterization.

Benign Overfitting in Binary Classification of Gaussian Mixtures

  • Ke WangChristos Thrampoulidis
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
This paper studies benign-overfitting for data generated from a popular binary Gaussian mixtures model (GMM) and classifiers trained by support-vector machines (SVM) to derive novel non-asymptotic bounds on the classification error of LS solution.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Kernel and Rich Regimes in Overparametrized Models

This work shows how the scale of the initialization controls the transition between the "kernel" and "rich" regimes and affects generalization properties in multilayer homogeneous models and highlights an interesting role for the width of a model in the case that the predictor is not identically zero at initialization.

The Implicit Bias of Gradient Descent on Separable Data

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the

Deep learning: a statistical viewpoint

This article surveys recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings, and focuses specifically on the linear regime for neural networks, where the network can be approximated by a linear model.

A Unifying View on Implicit Bias in Training Linear Neural Networks

The implicit bias of gradient flow is studied on linear neural network training, and it is proved that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell-2$ norms in the transformed input space.