Understanding deep learning (still) requires rethinking generalization

@article{Zhang2021UnderstandingDL,
  title={Understanding deep learning (still) requires rethinking generalization},
  author={Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals},
  journal={Communications of the ACM},
  year={2021},
  volume={64},
  pages={107 - 115}
}
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish… 

Figures and Tables from this paper

Contrasting random and learned features in deep Bayesian linear regression

Comparing deep random feature models to deep networks in which all layers are trained provides a detailed characterization of the interplay between width, depth, data density, and prior mismatch and begins to elucidate how architectural details affect generalization performance in this simple class of deep regression models.

Universal mean-field upper bound for the generalization gap of deep neural networks.

Results from replica mean field theory are employed to compute the generalization gap of machine learning models with quenched features, in the teacher-student scenario and for regression problems with quadratic loss function.

Overfreezing Meets Overparameterization: A Double Descent Perspective on Transfer Learning of Deep Neural Networks

It is demonstrated that the number of frozen layers can determine whether the transfer learning is effectively underparameterization or overparameterized and, in turn, this may affect the relative success or failure of learning.

Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping

This work explores the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent and proposes an early stopping rule that allows them to show optimal rates.

Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization

An analytic framework based on convex duality is introduced to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time and shows that optimal layer weights can be obtained as simple closed-form for-mulas in the high-dimensional and/or overparameterized regimes.

Robust Training under Label Noise by Over-parameterization

This work proposes a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted, and demonstrates state-of-the-art test accuracy against label noise on a variety of real datasets.

The Equilibrium Hypothesis: Rethinking implicit regularization in Deep Neural Networks

The Equilibrium Hypothesis is introduced and empirically validate, which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels.

Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with Linear Widths

It is proved that both GF and GD converge to a global minimum at a linear rate if the width m of the implicit network is linear in the sample size N, i.e. m = Ω( N ) .

Analytic Learning of Convolutional Neural Network For Pattern Recognition

Theoretically it is shown that ACnnL builds a closed-form solution similar to its MLP counterpart, but differs in their regularization constraints, which is able to answer to a certain extent why CNNs usually generalize better than MLPs from the implicit regularization point of view.

On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons

This work proves linear convergence rate of the training error – within an error region– for an overparameterized single-hidden layer perceptron with ReLU activations for a regression task and implies that, for fixed neuron selection probability, the error term decreases as the authors increase the number of surrogate models, and increases as they increase thenumber of local training steps.
...

References

SHOWING 1-10 OF 47 REFERENCES

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.

A Closer Look at Memorization in Deep Networks

The analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach

This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.

Dropout: a simple way to prevent neural networks from overfitting

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.

Minimum norm solutions do not always generalize well for over-parameterized problems

It is empirically show that the minimum norm solution is not necessarily the proper gauge of good generalization in simplified scenaria, and different models found by adaptive methods could outperform plain gradient methods.

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data.

Deep vs. shallow networks : An approximation theory perspective

A new definition of relative dimension is proposed to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

Stronger generalization bounds for deep nets via a compression approach

These results provide some theoretical justification for widespread empirical success in compressing deep nets and show generalization bounds that're orders of magnitude better in practice.