Corpus ID: 5834589

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

@article{Keskar2017OnLT,
  title={On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima},
  author={Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang},
  journal={ArXiv},
  year={2017},
  volume={abs/1609.04836}
}
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop… Expand
Extrapolation for Large-batch Training in Deep Learning
TLDR
This work proposes to use computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima, and proves the convergence of this novel scheme and rigorously evaluates its empirical performance on ResNet, LSTM, and Transformer. Expand
Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent
TLDR
It is proved that SGD tends to converge to flatter minima in the asymptotic regime (although may take exponential time to converge) regardless of the batch size, and thatSGD with a larger ratio of learning rate to batch size tends to convergence to a flat minimum faster, however, its generalization performance could be worse. Expand
Stochastic Gradient Descent with Large Learning Rate
TLDR
The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Expand
An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise
TLDR
The empirical studies with standard deep learning model-architectures and datasets shows that the proposed add covariance noise to the gradients method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated. Expand
The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent
Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradient Descent (SGD), have been extremely successful in training neural networks with strong generalization properties. In theExpand
SmoothOut: Smoothing Out Sharp Minima for Generalization in Large-Batch Deep Learning
TLDR
It is proved that the Stochastic SmoothOut is an unbiased approximation of the original SmoothOut and can eliminate sharp minima in Deep Neural Networks (DNNs) and thereby close generalization gap. Expand
STOCHASTIC GRADIENT DESCENT WITH MODERATE LEARNING RATE
Understanding the algorithmic bias of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus onExpand
Stochastic Normalized Gradient Descent with Momentum for Large Batch Training
TLDR
This paper theoretically prove that compared to momentum SGD (MSGD), SNGM can adopt a larger batch size to converge to the $\epsilon$-stationary point with the same computation complexity (total number of gradient computation). Expand
Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits
TLDR
Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs. Expand
A closer look at batch size in mini-batch training of deep auto-encoders
  • Heng Wang, Kaijun Ren, Jun-qiang Song
  • Computer Science
  • 2017 3rd IEEE International Conference on Computer and Communications (ICCC)
  • 2017
TLDR
This paper tested the generalizability of deep auto-encoder trained with varying batch size and checked some well-known measures relating to model generalization, finding no obvious generalization gap in regression model such asauto-encoders. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 61 REFERENCES
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Expand
Optimization Methods for Large-Scale Machine Learning
TLDR
A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning. Expand
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
TLDR
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time. Expand
On the importance of initialization and momentum in deep learning
TLDR
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand
adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs
TLDR
adaQN is presented, a stochastic quasi-Newton algorithm for training RNNs that retains a low per-iteration cost while allowing for non-diagonal scaling through a Stochastic L-BFGS updating scheme and is judicious in storing and retaining L- BFGS curvature pairs. Expand
Train faster, generalize better: Stability of stochastic gradient descent
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmicallyExpand
No bad local minima: Data independent training error guarantees for multilayer neural networks
TLDR
It is proved that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization, and extended to the case of more than onehidden layer. Expand
Sample size selection in optimization methods for machine learning
TLDR
A criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient, and establishes an O(1/\epsilon) complexity bound on the total cost of a gradient method. Expand
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand
...
1
2
3
4
5
...