# Uniform convergence may be unable to explain generalization in deep learning

@inproceedings{Nagarajan2019UniformCM, title={Uniform convergence may be unable to explain generalization in deep learning}, author={Vaishnavh Nagarajan and J. Zico Kolter}, booktitle={NeurIPS}, year={2019} }

Aimed at explaining the surprisingly good generalization behavior of overparameterized deep networks, recent works have developed a variety of generalization bounds for deep learning, all based on the fundamental learning-theoretic technique of uniform convergence. While it is well-known that many of these existing bounds are numerically large, through numerous experiments, we bring to light a more concerning aspect of these bounds: in practice, these bounds can {\em increase} with the training…

## Figures from this paper

## 149 Citations

On the Generalization Mystery in Deep Learning

- Computer ScienceArXiv
- 2022

The theory provides a causal explanation of how over-parameterized neural networks trained with gradient descent generalize well, and motivates a class of simple modiﬁcations to GD that attenuate memorization and improve generalization.

Generalization bounds for deep learning

- Computer ScienceArXiv
- 2020

Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.

Measuring Generalization with Optimal Transport

- Computer ScienceNeurIPS
- 2021

Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave…

What can linearized neural networks actually say about generalization?

- Computer ScienceNeurIPS
- 2021

It is shown that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances, and that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.

Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning

- Computer Science
- 2019

This work shows that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense, and shows that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.

Generalization of GANs and overparameterized models under Lipschitz continuity

- Computer Science
- 2021

Borders show that penalizing the Lipschitz constant of the GAN loss can improve generalization, and it is shown that, when using Dropout or spectral normalization, both truly deep neural networks and GANs can generalize well without the curse of dimensionality.

On generalization bounds for deep networks based on loss surface implicit regularization

- Computer Science
- 2022

This work argues that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks.

Good linear classifiers are abundant in the interpolating regime

- Computer ScienceArXiv
- 2020

The results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice, and that approaches based on the statistical mechanics of learning offer a promising alternative.

Uniform Convergence, Adversarial Spheres and a Simple Remedy

- Computer ScienceICML
- 2021

It is proved that the Neural Tangent Kernel (NTK) also suffers from the same phenomenon and its origin is uncovered and the important role of the output bias is highlighted and theoretically as well as empirically how a sensible choice completely mitigates the problem is highlighted.

UNDERSTANDING GENERALIZATION IN GRADIENT DESCENT-BASED OPTIMIZATION

- Computer Science
- 2020

This work proposes an approach to answering why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data based on a hypothesis about the dynamics of gradient descent that is called Coherent Gradients.

## References

SHOWING 1-10 OF 51 REFERENCES

To understand deep learning we need to understand kernel learning

- Computer ScienceICML
- 2018

It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

- Computer ScienceICLR
- 2017

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

Understanding deep learning requires rethinking generalization

- Computer ScienceICLR
- 2017

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience

- Computer ScienceICLR
- 2019

A general PAC-Bayesian framework that provides a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.

Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach

- Computer ScienceICLR
- 2019

This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

- Computer ScienceICLR
- 2018

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.

Stronger generalization bounds for deep nets via a compression approach

- Computer ScienceICML
- 2018

These results provide some theoretical justification for widespread empirical success in compressing deep nets and show generalization bounds that're orders of magnitude better in practice.

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

- Computer ScienceNIPS
- 2017

This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

- Computer ScienceUAI
- 2017

By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.

Generalization in Deep Networks: The Role of Distance from Initialization

- Computer ScienceArXiv
- 2019

Empirical evidences are provided that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of the distance from initialization, and theoretical arguments that further highlight the need for initialization-dependent notions of model capacity are highlighted.