Corpus ID: 220647137

Early Stopping in Deep Networks: Double Descent and How to Eliminate it

  title={Early Stopping in Deep Networks: Double Descent and How to Eliminate it},
  author={Reinhard Heckel and Fatih Yilmaz},
Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, whereas a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent arises for a different reason: It is caused by a superposition of two or more… Expand

Figures from this paper

Development and prospective validation of COVID-19 chest X-ray screening model for patients attending emergency departments
An AI algorithm is developed, CovIx, to differentiate normal, abnormal, non-CO VID-19 pneumonia, and COVID-19 CXRs using a multicentre cohort of 293,143 CX Rs, and performs on-par with four board-certified radiologists. Expand
Disparity Between Batches as a Signal for Early Stopping
We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $\ell_2$ norm distanceExpand
Optimization Variance: Exploring Generalization Properties of DNNs
A novel metric, optimization variance (OV), is proposed, to measure the diversity of model updates caused by the stochastic gradients of random training batches drawn in the same iteration, and hence early stopping may be achieved without using a validation set. Expand
When and how epochwise double descent happens
This work develops an analytically tractable model of epochwise double descent that allows us to characterise theoretically when this effect is likely to occur and shows experimentally that deep neural networks behave similarly to the theoretical model. Expand
  • 2020
Unlike the conventional wisdom in statistical learning theory, the test error of a deep neural network (DNN) often demonstrates double descent: as the model complexity increases, it first follows aExpand


Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime
A quantitative theory for the double descent of test error in the so-called lazy learning regime of neural networks is developed by considering the problem of learning a high-dimensional function with random features regression, and it is shown that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant. Expand
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand
The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve
Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that theyExpand
Deep Double Descent: Where Bigger Models and More Data Hurt
The notion of model complexity allows us to identify certain regimes where increasing the number of train samples actually hurts test performance and defines a new complexity measure, which is called the effective model complexity, and conjecture a generalized double descent with respect to this measure. Expand
Optimal Regularization Can Mitigate Double Descent
This work proves that for certain linear regression models with isotropic data distribution, optimally-tuned $\ell_2$ regularization achieves monotonic test performance as the authors grow either the sample size or the model size, and demonstrates empirically that optimalsized regularization can mitigate double descent for more general models, including neural networks. Expand
Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks
Under a rich dataset model, it is shown that gradient descent is provably robust to noise/corruption on a constant fraction of the labels despite overparameterization and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting. Expand
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure. Expand
Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks
Focusing on shallow neural nets and smooth activations, it is shown that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Expand
High-dimensional dynamics of generalization error in neural networks
It is found that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks, and standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks. Expand
Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
A data-dependent optimization and generalization theory which leverages the low-rank structure of the Jacobian matrix associated with the network and shows that even constant width neural nets can provably generalize for sufficiently nice datasets. Expand