• Corpus ID: 6212000

Understanding deep learning requires rethinking generalization

  title={Understanding deep learning requires rethinking generalization},
  author={Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals},
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. [] Key ResultWe interpret our experimental findings by comparison with traditional models.

Figures and Tables from this paper

Understanding deep learning (still) requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity.

Deep Nets Don't Learn via Memorization

It is established that there are qualitative differences when learning noise vs. natural datasets, and that for appropriately tuned explicit regularization, e.g. dropout, DNN training performance can be degraded on noise datasets without compromising generalization on real data.

Uniform convergence may be unable to explain generalization in deep learning

Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.

Towards Understanding the Generalization Bias of Two Layer Convolutional Linear Classifiers with Gradient Descent

A general analysis of the generalization performance as a function of data distribution and convolutional filter size is provided, given gradient descent as the optimization algorithm, and the results are interpreted using concrete examples.


  • Computer Science
  • 2020
This paper introduces a new, theoretically motivated measure of a network’s simplicity: the smallest fraction of the network‘s parameters that can be kept while pruning without adversely affecting its training loss.

What Do Neural Networks Learn When Trained With Random Labels?

It is shown analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels, and how this alignment produces a positive transfer.

What can linearized neural networks actually say about generalization?

It is shown that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances, and that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.

Scaling description of generalization with number of parameters in deep learning

This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affects the generalization error for classification.

Towards Understanding and Improving the Generalization Performance of Neural Networks

Input Output Convex Neural Networks (IOC-NNs) self regularize and significantly reduce the problem of overfitting and achieve similar performance as compared to base convolutional architectures and IOCNNs show robustness to noise in train labels.

Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

It is shown that pure SGD tends to converge to minimas that have better generalization performances in multiple natural language processing (NLP) tasks, and neural network's finite learning capability does not impact the intrinsic nature of SGD's implicit regularization effect.



Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Dropout: a simple way to prevent neural networks from overfitting

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically

Rethinking the Inception Architecture for Computer Vision

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Convolutional Rectifier Networks as Generalized Tensor Decompositions

Developing effective methods for training convolutional arithmetic circuits may give rise to a deep learning architecture that is provably superior to Convolutional rectifier networks, which has so far been overlooked by practitioners.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

Learning Feature Representations with K-Means

This chapter will summarize recent results and technical tricks that are needed to make effective use of K-means clustering for learning large-scale representations of images and connect these results to other well-known algorithms to make clear when K-Means can be most useful.

The Loss Surfaces of Multilayer Networks

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.