# Understanding deep learning requires rethinking generalization

@article{Zhang2017UnderstandingDL, title={Understanding deep learning requires rethinking generalization}, author={Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals}, journal={ArXiv}, year={2017}, volume={abs/1611.03530} }

Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance. [... ] Key ResultWe interpret our experimental findings by comparison with traditional models. Expand

## 3,666 Citations

### Understanding deep learning (still) requires rethinking generalization

- Computer ScienceCommun. ACM
- 2021

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity.

### Deep Nets Don't Learn via Memorization

- Computer ScienceICLR
- 2017

It is established that there are qualitative differences when learning noise vs. natural datasets, and that for appropriately tuned explicit regularization, e.g. dropout, DNN training performance can be degraded on noise datasets without compromising generalization on real data.

### Uniform convergence may be unable to explain generalization in deep learning

- Computer ScienceNeurIPS
- 2019

Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.

### Towards Understanding the Generalization Bias of Two Layer Convolutional Linear Classifiers with Gradient Descent

- Computer ScienceAISTATS
- 2019

A general analysis of the generalization performance as a function of data distribution and convolutional filter size is provided, given gradient descent as the optimization algorithm, and the results are interpreted using concrete examples.

### R OBUSTNESS TO PRUNING PREDICTS GENERALIZATION IN DEEP NEURAL NETWORKS

- Computer Science
- 2020

This paper introduces a new, theoretically motivated measure of a networkâ€™s simplicity: the smallest fraction of the networkâ€˜s parameters that can be kept while pruning without adversely affecting its training loss.

### What Do Neural Networks Learn When Trained With Random Labels?

- Computer ScienceNeurIPS
- 2020

It is shown analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels, and how this alignment produces a positive transfer.

### What can linearized neural networks actually say about generalization?

- Computer ScienceNeurIPS
- 2021

It is shown that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances, and that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.

### Scaling description of generalization with number of parameters in deep learning

- Computer ScienceJournal of Statistical Mechanics: Theory and Experiment
- 2020

This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function fâ€ŠN around its expectation, which affects the generalization error for classification.

### Towards Understanding and Improving the Generalization Performance of Neural Networks

- Computer Science
- 2022

Input Output Convex Neural Networks (IOC-NNs) self regularize and significantly reduce the problem of overfitting and achieve similar performance as compared to base convolutional architectures and IOCNNs show robustness to noise in train labels.

### Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

- Computer ScienceArXiv
- 2018

It is shown that pure SGD tends to converge to minimas that have better generalization performances in multiple natural language processing (NLP) tasks, and neural network's finite learning capability does not impact the intrinsic nature of SGD's implicit regularization effect.

## References

SHOWING 1-10 OF 36 REFERENCES

### Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

- Computer ScienceICML
- 2015

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

### Dropout: a simple way to prevent neural networks from overfitting

- Computer ScienceJ. Mach. Learn. Res.
- 2014

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

### Learning Multiple Layers of Features from Tiny Images

- Computer Science
- 2009

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

### Train faster, generalize better: Stability of stochastic gradient descent

- Computer ScienceICML
- 2016

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmicallyâ€¦

### Rethinking the Inception Architecture for Computer Vision

- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

### Deep Residual Learning for Image Recognition

- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

### Convolutional Rectifier Networks as Generalized Tensor Decompositions

- Computer ScienceICML
- 2016

Developing effective methods for training convolutional arithmetic circuits may give rise to a deep learning architecture that is provably superior to Convolutional rectifier networks, which has so far been overlooked by practitioners.

### ImageNet classification with deep convolutional neural networks

- Computer ScienceCommun. ACM
- 2012

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

### Learning Feature Representations with K-Means

- Computer ScienceNeural Networks: Tricks of the Trade
- 2012

This chapter will summarize recent results and technical tricks that are needed to make effective use of K-means clustering for learning large-scale representations of images and connect these results to other well-known algorithms to make clear when K-Means can be most useful.

### The Loss Surfaces of Multilayer Networks

- Computer ScienceAISTATS
- 2015

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.