# Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

@article{Li2019TowardsET, title={Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks}, author={Y. Li and Colin Wei and Tengyu Ma}, journal={ArXiv}, year={2019}, volume={abs/1907.04595} }

Stochastic gradient descent with a large initial learning rate is a widely adopted method for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing… CONTINUE READING

36 Citations

The large learning rate phase of deep learning: the catapult mechanism

- Mathematics, Computer Science
- 2020

9

Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

- Mathematics, Computer Science
- 2020

1

On the Generalization Benefit of Noise in Stochastic Gradient Descent

- Computer Science, Mathematics
- 2020

1

Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width

- Mathematics, Computer Science
- 2020

4

Deep Learning Requires Explicit Regularization for Reliable Predictive Probability

- Computer Science, Mathematics
- 2020

Backward Feature Correction: How Deep Learning Performs Deep Learning

- Computer Science, Mathematics
- 2020

11

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

- Computer Science, Mathematics
- 2020

#### References

##### Publications referenced by this paper.

SHOWING 1-10 OF 44 REFERENCES

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

- Mathematics, Computer Science
- 2017

291

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

- Computer Science, Mathematics
- 2018

143

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

- Computer Science, Mathematics
- 2018

49

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Mathematics, Computer Science
- 2019

374

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

- Computer Science, Mathematics
- 2017

951

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

- Mathematics, Computer Science
- 2019

222