Corpus ID: 195874215

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

@article{Li2019TowardsET,
  title={Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks},
  author={Y. Li and Colin Wei and Tengyu Ma},
  journal={ArXiv},
  year={2019},
  volume={abs/1907.04595}
}
  • Y. Li, Colin Wei, Tengyu Ma
  • Published 2019
  • Computer Science, Mathematics
  • ArXiv
  • Stochastic gradient descent with a large initial learning rate is a widely adopted method for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing… CONTINUE READING

    Figures, Tables, and Topics from this paper.

    How Does Learning Rate Decay Help Modern Neural Networks
    6
    The large learning rate phase of deep learning: the catapult mechanism
    9
    Extrapolation for Large-batch Training in Deep Learning
    Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems
    1
    On the Generalization Benefit of Noise in Stochastic Gradient Descent
    1
    Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width
    4
    Deep Learning Requires Explicit Regularization for Reliable Predictive Probability
    On Learning Rates and Schrödinger Operators
    5

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 44 REFERENCES
    Train longer, generalize better: closing the generalization gap in large batch training of neural networks
    291
    Cyclical Learning Rates for Training Neural Networks
    576
    SGD on Neural Networks Learns Functions of Increasing Complexity
    24
    On the Convergence Rate of Training Recurrent Neural Networks
    54
    Don't Decay the Learning Rate, Increase the Batch Size
    350
    A Bayesian Perspective on Generalization and Stochastic Gradient Descent
    143
    Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
    49
    Gradient Descent Provably Optimizes Over-parameterized Neural Networks
    374
    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
    951
    Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
    222