• Corpus ID: 61153527

Uniform convergence may be unable to explain generalization in deep learning

@inproceedings{Nagarajan2019UniformCM,
  title={Uniform convergence may be unable to explain generalization in deep learning},
  author={Vaishnavh Nagarajan and J. Zico Kolter},
  booktitle={NeurIPS},
  year={2019}
}
Aimed at explaining the surprisingly good generalization behavior of overparameterized deep networks, recent works have developed a variety of generalization bounds for deep learning, all based on the fundamental learning-theoretic technique of uniform convergence. While it is well-known that many of these existing bounds are numerically large, through numerous experiments, we bring to light a more concerning aspect of these bounds: in practice, these bounds can {\em increase} with the training… 
Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence
TLDR
This analysis provides insight on why memorization can coexist with generalization: in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data.
On the Generalization Mystery in Deep Learning
TLDR
The theory provides a causal explanation of how over-parameterized neural networks trained with gradient descent generalize well, and motivates a class of simple modifications to GD that attenuate memorization and improve generalization.
Generalization bounds for deep learning
TLDR
Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.
Generalization Through The Lens Of Leave-One-Out Error
TLDR
It is demonstrated that the leave-one-out error provides a tractable way to estimate the generalization ability of deep neural networks in the kernel regime, opening the door to potential, new research directions in the field of generalization.
Measuring Generalization with Optimal Transport
Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave
In Search of Robust Measures of Generalization
TLDR
This work addresses the question of how to evaluate generalization bounds empirically and argues that generalization measures should instead be evaluated within the framework of distributional robustness.
What can linearized neural networks actually say about generalization?
TLDR
It is shown that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances, and that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.
Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning
TLDR
This work shows that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense, and shows that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.
Generalization of GANs and overparameterized models under Lipschitz continuity
TLDR
Borders show that penalizing the Lipschitz constant of the GAN loss can improve generalization, and it is shown that, when using Dropout or spectral normalization, both truly deep neural networks and GANs can generalize well without the curse of dimensionality.
On generalization bounds for deep networks based on loss surface implicit regularization
TLDR
This work argues that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks.
...
...

References

SHOWING 1-10 OF 51 REFERENCES
To understand deep learning we need to understand kernel learning
TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience
TLDR
A general PAC-Bayesian framework that provides a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach
TLDR
This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
TLDR
This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.
Stronger generalization bounds for deep nets via a compression approach
TLDR
These results provide some theoretical justification for widespread empirical success in compressing deep nets and show generalization bounds that're orders of magnitude better in practice.
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
TLDR
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
TLDR
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.
Generalization in Deep Networks: The Role of Distance from Initialization
TLDR
Empirical evidences are provided that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of the distance from initialization, and theoretical arguments that further highlight the need for initialization-dependent notions of model capacity are highlighted.
...
...