Corpus ID: 5834589

# On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

@article{Keskar2017OnLT,
title={On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima},
author={Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang},
journal={ArXiv},
year={2017},
volume={abs/1609.04836}
}
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop… Expand
1,489 Citations

#### Figures, Tables, and Topics from this paper

Extrapolation for Large-batch Training in Deep Learning
• Computer Science, Mathematics
• ICML
• 2020
This work proposes to use computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima, and proves the convergence of this novel scheme and rigorously evaluates its empirical performance on ResNet, LSTM, and Transformer. Expand
Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent
• Mathematics, Computer Science
• ArXiv
• 2018
It is proved that SGD tends to converge to flatter minima in the asymptotic regime (although may take exponential time to converge) regardless of the batch size, and thatSGD with a larger ratio of learning rate to batch size tends to convergence to a flat minimum faster, however, its generalization performance could be worse. Expand
Stochastic Gradient Descent with Large Learning Rate
• Mathematics, Computer Science
• ArXiv
• 2020
The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Expand
An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise
• Computer Science, Mathematics
• 2019
The empirical studies with standard deep learning model-architectures and datasets shows that the proposed add covariance noise to the gradients method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated. Expand
The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent
Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradient Descent (SGD), have been extremely successful in training neural networks with strong generalization properties. In theExpand
SmoothOut: Smoothing Out Sharp Minima for Generalization in Large-Batch Deep Learning
• Computer Science
• ArXiv
• 2018
It is proved that the Stochastic SmoothOut is an unbiased approximation of the original SmoothOut and can eliminate sharp minima in Deep Neural Networks (DNNs) and thereby close generalization gap. Expand
STOCHASTIC GRADIENT DESCENT WITH MODERATE LEARNING RATE
• Jingfeng Wu
• 2021
Understanding the algorithmic bias of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus onExpand
Stochastic Normalized Gradient Descent with Momentum for Large Batch Training
• Computer Science, Mathematics
• ArXiv
• 2020
This paper theoretically prove that compared to momentum SGD (MSGD), SNGM can adopt a larger batch size to converge to the $\epsilon$-stationary point with the same computation complexity (total number of gradient computation). Expand
Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits
• Hao Chen, Lili Zheng
• Computer Science, Mathematics
• ArXiv
• 2021
Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs. Expand
A closer look at batch size in mini-batch training of deep auto-encoders
• Heng Wang
• Computer Science
• 2017 3rd IEEE International Conference on Computer and Communications (ICCC)
• 2017
This paper tested the generalizability of deep auto-encoder trained with varying batch size and checked some well-known measures relating to model generalization, finding no obvious generalization gap in regression model such asauto-encoders. Expand

#### References

SHOWING 1-10 OF 61 REFERENCES
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
• Computer Science
• ICML
• 2015
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Expand
Optimization Methods for Large-Scale Machine Learning
• Computer Science, Mathematics
• SIAM Rev.
• 2018
A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning. Expand
Adam: A Method for Stochastic Optimization
• Computer Science, Mathematics
• ICLR
• 2015
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time. Expand
On the importance of initialization and momentum in deep learning
• Computer Science
• ICML
• 2013
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand
• Computer Science, Mathematics
• ECML/PKDD
• 2016
adaQN is presented, a stochastic quasi-Newton algorithm for training RNNs that retains a low per-iteration cost while allowing for non-diagonal scaling through a Stochastic L-BFGS updating scheme and is judicious in storing and retaining L- BFGS curvature pairs. Expand
Train faster, generalize better: Stability of stochastic gradient descent
• Computer Science, Mathematics
• ICML
• 2016
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmicallyExpand
No bad local minima: Data independent training error guarantees for multilayer neural networks
• Mathematics, Computer Science
• ArXiv
• 2016
It is proved that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization, and extended to the case of more than onehidden layer. Expand
Sample size selection in optimization methods for machine learning
• Computer Science, Mathematics
• Math. Program.
• 2012
A criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient, and establishes an O(1/\epsilon) complexity bound on the total cost of a gradient method. Expand