Understanding deep learning (still) requires rethinking generalization

@article{Zhang2021UnderstandingDL,
  title={Understanding deep learning (still) requires rethinking generalization},
  author={Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals},
  journal={Communications of the ACM},
  year={2021},
  volume={64},
  pages={107 - 115}
}
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish… 

Figures and Tables from this paper

Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping
TLDR
This work explores the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent and proposes an early stopping rule that allows them to show optimal rates.
The Equilibrium Hypothesis: Rethinking implicit regularization in Deep Neural Networks
TLDR
The Equilibrium Hypothesis is introduced and empirically validate, which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels.
On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons
TLDR
This work proves linear convergence rate of the training error – within an error region– for an overparameterized single-hidden layer perceptron with ReLU activations for a regression task and implies that, for fixed neuron selection probability, the error term decreases as the authors increase the number of surrogate models, and increases as they increase thenumber of local training steps.
Unveiling the structure of wide flat minima in neural networks
TLDR
It is shown that wide flat minima arise as complex extensive structures, from the coalescence of minima around "high-margin" (i.e., locally robust) configurations, despite being exponentially rare compared to zero-margin ones, because high-margin minima tend to concentrate in particular regions.
Depth Without the Magic: Inductive Bias of Natural Gradient Descent
TLDR
It is demonstrated that there exist learning problems where natural gradient descent fails to generalize, while gradient descent with the right architecture performs well.
Neurashed: A Phenomenological Model for Imitating Deep Learning Training
TLDR
It is argued that a future deep learning theory should inherit three characteristics: a hierarchically structured network architecture, parameters iteratively optimized using stochastic gradient-based methods, and information from the data that evolves compressively.
Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization
TLDR
This work is the first exact statistical mechanical study of learning in a family of Deep Neural Networks, and the first successful theory of learning through the successive integration of Degrees of Freedom in the learned weight space.
Compression Implies Generalization
TLDR
A compression-based framework is established that is simple and powerful enough to extend the generalization bounds by Arora et al. to also hold for the original network and allows for simple proofs of the strongest known generalization limits for other popular machine learning models, namely Support Vector Machines and Boosting.
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
TLDR
This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective and emphasizes the unique aspects that define the TOPML research area as a subfield of modern ML theory.
The Separation Capacity of Random Neural Networks
TLDR
It is shown that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve the data separation problem under what conditions can a random neural network make two classes X−,X+ linearly separable.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 67 REFERENCES
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
TLDR
It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.
A Closer Look at Memorization in Deep Networks
TLDR
The analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach
TLDR
This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.
Dropout: a simple way to prevent neural networks from overfitting
TLDR
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
TLDR
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.
Minimum norm solutions do not always generalize well for over-parameterized problems
TLDR
It is empirically show that the minimum norm solution is not necessarily the proper gauge of good generalization in simplified scenaria, and different models found by adaptive methods could outperform plain gradient methods.
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
TLDR
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data.
Deep vs. shallow networks : An approximation theory perspective
TLDR
A new definition of relative dimension is proposed to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.
Stronger generalization bounds for deep nets via a compression approach
TLDR
These results provide some theoretical justification for widespread empirical success in compressing deep nets and show generalization bounds that're orders of magnitude better in practice.
...
1
2
3
4
5
...