Corpus ID: 52071640

Don't Use Large Mini-Batches, Use Local SGD

@article{Lin2020DontUL,
  title={Don't Use Large Mini-Batches, Use Local SGD},
  author={Tao Lin and S. Stich and M. Jaggi},
  journal={ArXiv},
  year={2020},
  volume={abs/1808.07217}
}
  • Tao Lin, S. Stich, M. Jaggi
  • Published 2020
  • Computer Science, Mathematics
  • ArXiv
  • Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization… CONTINUE READING
    127 Citations
    Scalable and Practical Natural Gradient for Large-Scale Deep Learning
    • 3
    • PDF
    Extrapolation for Large-batch Training in Deep Learning
    • PDF
    The Limit of the Batch Size
    • PDF
    Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks
    • 36
    • PDF
    Local SGD Converges Fast and Communicates Little
    • S. Stich
    • Computer Science, Mathematics
    • ICLR
    • 2019
    • 203
    • PDF
    DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging
    • PDF
    Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning
    • 117
    • PDF
    Communication trade-offs for Local-SGD with large step size
    • 7
    • PDF

    References

    SHOWING 1-10 OF 106 REFERENCES
    Revisiting Small Batch Training for Deep Neural Networks
    • 231
    • PDF
    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
    • 1,401
    • Highly Influential
    • PDF
    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
    • 1,109
    • Highly Influential
    • PDF
    Large Batch Training of Convolutional Networks
    • 168
    • Highly Influential
    • PDF
    An Empirical Model of Large-Batch Training
    • 81
    • PDF
    Scaling SGD Batch Size to 32K for ImageNet Training
    • 230
    • Highly Influential
    • PDF
    Local SGD Converges Fast and Communicates Little
    • S. Stich
    • Computer Science, Mathematics
    • ICLR
    • 2019
    • 203
    • PDF
    Don't Decay the Learning Rate, Increase the Batch Size
    • 437
    • PDF