Corpus ID: 53387011

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

@article{Jastrzebski2019OnTR,
  title={On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length},
  author={Stanislaw Jastrzebski and Zachary Kenton and Nicolas Ballas and Asja Fischer and Yoshua Bengio and A. Storkey},
  journal={arXiv: Machine Learning},
  year={2019}
}
  • Stanislaw Jastrzebski, Zachary Kenton, +3 authors A. Storkey
  • Published 2019
  • Mathematics, Computer Science
  • arXiv: Machine Learning
  • Stochastic Gradient Descent (SGD) based training of neural networks with a large learning rate or a small batch-size typically ends in well-generalizing, flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. [...] Key Result In summary, our analysis of the dynamics of SGD in the subspace of the sharpest directions shows that they influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the…Expand Abstract
    33 Citations
    The Break-Even Point on Optimization Trajectories of Deep Neural Networks
    • 16
    • Highly Influenced
    • PDF
    THE BREAK-EVEN POINT ON THE OPTIMIZATION TRA- JECTORIES OF DEEP NEURAL NETWORKS
    • 2019
    Curvature is Key: Sub-Sampled Loss Surfaces and the Implications for Large Batch Training
    • 1
    Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD
    • 2
    • PDF
    Experimental exploration on loss surface of deep neural network
    Emergent properties of the local geometry of neural loss landscapes
    • 10
    • PDF
    On Learning Rates and Schrödinger Operators
    • 7
    • PDF
    Laplacian Smoothing Gradient Descent
    • 19
    • PDF

    References

    SHOWING 1-10 OF 34 REFERENCES
    A Walk with SGD
    • 45
    • PDF
    Three Factors Influencing Minima in SGD
    • 169
    • PDF
    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
    • 1,053
    • Highly Influential
    • PDF
    SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning
    • 12
    • PDF
    The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent
    • 13
    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
    • 1,321
    • PDF
    Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks
    • 134
    • PDF
    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
    • 156
    • PDF
    Understanding Batch Normalization
    • 144
    • Highly Influential
    • PDF
    High-dimensional dynamics of generalization error in neural networks
    • 156
    • PDF