Corpus ID: 209323795

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

@article{Zhang2020WhyGC,
  title={Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity},
  author={J. Zhang and Tianxing He and S. Sra and A. Jadbabaie},
  journal={arXiv: Optimization and Control},
  year={2020}
}
  • J. Zhang, Tianxing He, +1 author A. Jadbabaie
  • Published 2020
  • Mathematics, Computer Science
  • arXiv: Optimization and Control
  • We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively… CONTINUE READING
    Stochastic Normalized Gradient Descent with Momentum for Large Batch Training
    Understanding the Role of Adversarial Regularization in Supervised Learning
    A hybrid learning method for system identification and optimal control
    Autoclip: Adaptive Gradient Clipping for Source Separation Networks
    • 2
    • PDF

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 61 REFERENCES
    Recurrent neural network based language model
    • 3,958
    • Highly Influential
    • PDF
    Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
    • 6,012
    • Highly Influential
    • PDF
    Learning Multiple Layers of Features from Tiny Images
    • 9,074
    • Highly Influential
    • PDF
    Lower bounds for finding stationary points I
    • 92
    • Highly Influential
    • PDF
    Regularizing and Optimizing LSTM Language Models
    • 583
    • Highly Influential
    • PDF
    Deep Learning without Poor Local Minima
    • 535
    • Highly Influential
    • PDF
    Deep Residual Learning for Image Recognition
    • 50,349
    • Highly Influential
    • PDF
    Regularization of Neural Networks using DropConnect
    • 1,694
    • Highly Influential
    • PDF
    A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems
    • 7,667
    • PDF
    A Proximal Stochastic Gradient Method with Progressive Variance Reduction
    • 528
    • PDF