• Corpus ID: 61153527

Uniform convergence may be unable to explain generalization in deep learning

@inproceedings{Nagarajan2019UniformCM,
title={Uniform convergence may be unable to explain generalization in deep learning},
author={Vaishnavh Nagarajan and J. Zico Kolter},
booktitle={NeurIPS},
year={2019}
}
• Published in NeurIPS 13 February 2019
• Computer Science
Aimed at explaining the surprisingly good generalization behavior of overparameterized deep networks, recent works have developed a variety of generalization bounds for deep learning, all based on the fundamental learning-theoretic technique of uniform convergence. While it is well-known that many of these existing bounds are numerically large, through numerous experiments, we bring to light a more concerning aspect of these bounds: in practice, these bounds can {\em increase} with the training…
149 Citations

Figures from this paper

On the Generalization Mystery in Deep Learning
• Computer Science
ArXiv
• 2022
The theory provides a causal explanation of how over-parameterized neural networks trained with gradient descent generalize well, and motivates a class of simple modiﬁcations to GD that attenuate memorization and improve generalization.
Generalization bounds for deep learning
• Computer Science
ArXiv
• 2020
Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.
Measuring Generalization with Optimal Transport
• Computer Science
NeurIPS
• 2021
Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave
What can linearized neural networks actually say about generalization?
• Computer Science
NeurIPS
• 2021
It is shown that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances, and that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.
Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning
• Computer Science
• 2019
This work shows that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense, and shows that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.
Generalization of GANs and overparameterized models under Lipschitz continuity
• Computer Science
• 2021
Borders show that penalizing the Lipschitz constant of the GAN loss can improve generalization, and it is shown that, when using Dropout or spectral normalization, both truly deep neural networks and GANs can generalize well without the curse of dimensionality.
On generalization bounds for deep networks based on loss surface implicit regularization
• Computer Science
• 2022
This work argues that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks.
Good linear classifiers are abundant in the interpolating regime
• Computer Science
ArXiv
• 2020
The results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice, and that approaches based on the statistical mechanics of learning offer a promising alternative.
Uniform Convergence, Adversarial Spheres and a Simple Remedy
• Computer Science
ICML
• 2021
It is proved that the Neural Tangent Kernel (NTK) also suffers from the same phenomenon and its origin is uncovered and the important role of the output bias is highlighted and theoretically as well as empirically how a sensible choice completely mitigates the problem is highlighted.
UNDERSTANDING GENERALIZATION IN GRADIENT DESCENT-BASED OPTIMIZATION
This work proposes an approach to answering why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data based on a hypothesis about the dynamics of gradient descent that is called Coherent Gradients.

References

SHOWING 1-10 OF 51 REFERENCES
To understand deep learning we need to understand kernel learning
• Computer Science
ICML
• 2018
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
• Computer Science
ICLR
• 2017
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
Understanding deep learning requires rethinking generalization
• Computer Science
ICLR
• 2017
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience
• Computer Science
ICLR
• 2019
A general PAC-Bayesian framework that provides a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach
• Computer Science
ICLR
• 2019
This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
• Computer Science
ICLR
• 2018
This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.
Stronger generalization bounds for deep nets via a compression approach
• Computer Science
ICML
• 2018
These results provide some theoretical justification for widespread empirical success in compressing deep nets and show generalization bounds that're orders of magnitude better in practice.
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
• Computer Science
NIPS
• 2017
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
• Computer Science
UAI
• 2017
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.
Generalization in Deep Networks: The Role of Distance from Initialization
• Computer Science
ArXiv
• 2019
Empirical evidences are provided that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of the distance from initialization, and theoretical arguments that further highlight the need for initialization-dependent notions of model capacity are highlighted.