• Corpus ID: 220831128

# Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

@article{Zhao2020StochasticNG,
title={Stochastic Normalized Gradient Descent with Momentum for Large Batch Training},
author={Shen-Yi Zhao and Yin-Peng Xie and Wu-Jun Li},
journal={ArXiv},
year={2020},
volume={abs/2007.13985}
}
• Published 28 July 2020
• Computer Science
• ArXiv
Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically…
3 Citations

## Figures and Tables from this paper

• Computer Science
ECCV
• 2022
It is proved that the convergence rate of ENGM is independent of the variance of the gradients, and thus, it is suitable for AT, and alleviates major shortcomings of AT including robust overfitting and high sensitivity to hyperparameter settings.
• Computer Science
ICML
• 2021
An extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers is performed, identifying a significantly reduced subset of specific algorithms and parameter choices that generally provided competitive results in the authors' experiments.

## References

SHOWING 1-10 OF 20 REFERENCES

• Computer Science
ICLR
• 2017
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
• Computer Science
ArXiv
• 2017
Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet.
• Computer Science
NIPS
• 2017
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.
• Computer Science
ICLR
• 2020
This work proposes a \emph{post-local} SGD and shows that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency and scalability.
• Computer Science
ArXiv
• 2017
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
• Computer Science
AAAI
• 2019
A thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.
• Computer Science
ArXiv
• 2019
NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay, performs on par or better than well tuned SGD with momentum and Adam or AdamW in experiments on neural networks.
• Computer Science
NIPS
• 2015
This paper analyzes a stochastic version of NGD and proves its convergence to a global minimum for a wider class of functions: it requires the functions to be quasi-convex and locally-Lipschitz.
• Computer Science
ICLR
• 2020
It is shown that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks, and positively correlates with the gradient norm, and contrary to standard assumptions in the literature.
• Computer Science
WMT
• 2018
This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation.