• Corpus ID: 244909075

Multi-scale Feature Learning Dynamics: Insights for Double Descent

@inproceedings{Pezeshki2021MultiscaleFL,
  title={Multi-scale Feature Learning Dynamics: Insights for Double Descent},
  author={Mohammad Pezeshki and Amartya Mitra and Yoshua Bengio and Guillaume Lajoie},
  booktitle={International Conference on Machine Learning},
  year={2021}
}
An intriguing phenomenon that arises from the high-dimensional learning dynamics of neural networks is the phenomenon of “double descent”. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous… 

Figures from this paper

An Empirical Deep Dive into Deep Learning's Driving Dynamics

We present an empirical dataset surveying the deep learning phenomenon on fully-connected networks, encompassing the training and test performance of numerous network topologies, sweeping across

Epoch-Wise Double Descent Triggered by Learning a Single Sample

FCNs are investigated and empirically found that optimizers play an important role in the memorization process by splitting the model into the bias of the output layer and the rest, which explains why early-stopping contributes to better generalizability.

Over-Training with Mixup May Hurt Generalization

This work reports a previously unobserved phenomenon in Mixup training: on a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs, giving rise to a U-shaped generalization curve.

Unifying Grokking and Double Descent

A principled understanding of generalization in deep learning requires unifying disparate observations under a single conceptual framework. Previous work has studied grokking , a training dynamic in

Similarity and Generalization: From Noise to Corruption

Probing the equivalence between online optimization and offline generalization in SNNs, it is shown that their correspondence breaks down in the presence of label noise for all the scenarios considered, and this phenomenon is called Density-Induced Break of Similarity (DIBS).

Towards Understanding Grokking: An Effective Theory of Representation Learning

This study provides intuitive explanations of the origin of grokking, but also highlights the usefulness of physics-inspired tools, e.g., effective theories and phase diagrams, for understanding deep learning.

Grokking phase transitions in learning local rules with gradient descent

A tensor-network map is introduced that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and it is shown thatGrokking is a consequence of the locality of the teacher model and the critical exponent and thegrokking time distributions are numerically determined.

References

SHOWING 1-10 OF 73 REFERENCES

Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Inspired by this theory, two standard convolutional networks are studied empirically and it is shown that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.

When and how epochwise double descent happens

This work develops an analytically tractable model of epochwise double descent that allows us to characterise theoretically when this effect is likely to occur and shows experimentally that deep neural networks behave similarly to the theoretical model.

High-dimensional dynamics of generalization error in neural networks

Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime

A quantitative theory for the double descent of test error in the so-called lazy learning regime of neural networks is developed by considering the problem of learning a high-dimensional function with random features regression, and it is shown that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant.

Deep double descent: where bigger models and more data hurt

The notion of model complexity allows us to identify certain regimes where increasing the number of train samples actually hurts test performance, and defines a new complexity measure called the effective model complexity and conjecture a generalized double descent with respect to this measure.

Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint

The exact population risk of the unregularized least squares regression problem with two-layer neural networks when either the first or the second layer is trained using a gradient flow under different initialization setups is derived.

Scaling description of generalization with number of parameters in deep learning

This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affects the generalization error for classification.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

It is shown that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions.

Triple descent and the two kinds of overfitting: where and why do they appear?

It is shown that the nonlinear peak at N = P is a true divergence caused by the extreme divergence of the output function to both the noise corrupting the labels and the initialization of the random features (or the weights in NNs).
...