# Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

@article{Li2019TowardsET, title={Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks}, author={Y. Li and Colin Wei and Tengyu Ma}, journal={ArXiv}, year={2019}, volume={abs/1907.04595} }

Stochastic gradient descent with a large initial learning rate is a widely adopted method for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing… CONTINUE READING

