On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

  title={On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima},
  author={Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang},
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say 32–512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in… CONTINUE READING
Highly Influential
This paper has highly influenced 39 other papers. REVIEW HIGHLY INFLUENTIAL CITATIONS
Highly Cited
This paper has 432 citations. REVIEW CITATIONS


Publications citing this paper.
Showing 1-10 of 293 extracted citations

432 Citations

Citations per Year
Semantic Scholar estimates that this publication has 432 citations based on the available data.

See our FAQ for additional information.


Publications referenced by this paper.
Showing 1-10 of 40 references

Deep learning. Book in preparation for MIT Press, 2016

  • Yoshua Bengio, Ian Goodfellow, Aaron Courville
  • URL http://www.deeplearningbook.org
  • 2016
1 Excerpt

Similar Papers

Loading similar papers…