• Corpus ID: 249394712

Redundant representations help generalization in wide neural networks

  title={Redundant representations help generalization in wide neural networks},
  author={Diego Doimo and Aldo Glielmo and Sebastian Goldt and Alessandro Laio},
Deep neural networks (DNNs) defy the classical bias-variance trade-off: adding parameters to a DNN that interpolates its training data will typically improve its generalization performance. Explaining the mechanism behind this “benign overfitting” in deep networks remains an outstanding challenge. Here, we study the last hidden layer representations of various state-of-the-art convolutional neural networks and find that if the last hidden representation is wide enough, its neurons tend to split… 



Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Intrinsic dimension of data representations in deep neural networks

The intrinsic dimensionality of data-representations is studied, i.e. the minimal number of parameters needed to describe a representation, and it is found that, in a trained network, the ID is orders of magnitude smaller than the number of units in each layer.

A Closer Look at Memorization in Deep Networks

The analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

Scaling description of generalization with number of parameters in deep learning

This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affects the generalization error for classification.

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

It is shown that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions.

The Implicit Bias of Gradient Descent on Separable Data

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the

Improving neural networks by preventing co-adaptation of feature detectors

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the

Wide Residual Networks

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.

On Connectivity of Solutions in Deep Learning: The Role of Over-parameterization and Feature Quality

This paper presents a novel condition for ensuring the connectivity of two arbitrary points in parameter space and shows that if subsets of features at each layer are linearly separable, then almost no over-parameterization is needed.

Neural Networks and the Bias/Variance Dilemma

It is suggested that current-generation feedforward neural networks are largely inadequate for difficult problems in machine perception and machine learning, regardless of parallel-versus-serial hardware or other implementation issues.