# Early Stopping in Deep Networks: Double Descent and How to Eliminate it

@article{Heckel2021EarlySI, title={Early Stopping in Deep Networks: Double Descent and How to Eliminate it}, author={Reinhard Heckel and Fatih Yilmaz}, journal={ArXiv}, year={2021}, volume={abs/2007.10099} }

Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, whereas a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent arises for a different reason: It is caused by a superposition of two or more… Expand

#### 5 Citations

Development and prospective validation of COVID-19 chest X-ray screening model for patients attending emergency departments

- Medicine
- Scientific reports
- 2021

An AI algorithm is developed, CovIx, to differentiate normal, abnormal, non-CO VID-19 pneumonia, and COVID-19 CXRs using a multicentre cohort of 293,143 CX Rs, and performs on-par with four board-certified radiologists. Expand

Disparity Between Batches as a Signal for Early Stopping

- Computer Science
- ECML/PKDD
- 2021

We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $\ell_2$ norm distance… Expand

Optimization Variance: Exploring Generalization Properties of DNNs

- Computer Science
- ArXiv
- 2021

A novel metric, optimization variance (OV), is proposed, to measure the diversity of model updates caused by the stochastic gradients of random training batches drawn in the same iteration, and hence early stopping may be achieved without using a validation set. Expand

When and how epochwise double descent happens

- Computer Science
- ArXiv
- 2021

This work develops an analytically tractable model of epochwise double descent that allows us to characterise theoretically when this effect is likely to occur and shows experimentally that deep neural networks behave similarly to the theoretical model. Expand

OPTIMIZATION VARIANCE: EXPLORING GENERAL-

- 2020

Unlike the conventional wisdom in statistical learning theory, the test error of a deep neural network (DNN) often demonstrates double descent: as the model complexity increases, it first follows a… Expand

#### References

SHOWING 1-10 OF 34 REFERENCES

Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime

- Mathematics, Computer Science
- ICML
- 2020

A quantitative theory for the double descent of test error in the so-called lazy learning regime of neural networks is developed by considering the problem of learning a high-dimensional function with random features regression, and it is shown that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant. Expand

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Computer Science, Mathematics
- ICLR
- 2019

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

- Mathematics
- 2019

Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they… Expand

Deep Double Descent: Where Bigger Models and More Data Hurt

- Computer Science, Mathematics
- ICLR
- 2020

The notion of model complexity allows us to identify certain regimes where increasing the number of train samples actually hurts test performance and defines a new complexity measure, which is called the effective model complexity, and conjecture a generalized double descent with respect to this measure. Expand

Optimal Regularization Can Mitigate Double Descent

- Computer Science, Mathematics
- ICLR
- 2021

This work proves that for certain linear regression models with isotropic data distribution, optimally-tuned $\ell_2$ regularization achieves monotonic test performance as the authors grow either the sample size or the model size, and demonstrates empirically that optimalsized regularization can mitigate double descent for more general models, including neural networks. Expand

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

- Computer Science, Mathematics
- AISTATS
- 2020

Under a rich dataset model, it is shown that gradient descent is provably robust to noise/corruption on a constant fraction of the labels despite overparameterization and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting. Expand

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

- Computer Science, Mathematics
- ICML
- 2019

This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure. Expand

Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks

- Computer Science, Mathematics
- IEEE Journal on Selected Areas in Information Theory
- 2020

Focusing on shallow neural nets and smooth activations, it is shown that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Expand

High-dimensional dynamics of generalization error in neural networks

- Computer Science, Mathematics
- Neural Networks
- 2020

It is found that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks, and standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks. Expand

Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

- Computer Science, Mathematics
- ArXiv
- 2019

A data-dependent optimization and generalization theory which leverages the low-rank structure of the Jacobian matrix associated with the network and shows that even constant width neural nets can provably generalize for sufficiently nice datasets. Expand