• Corpus ID: 220831128

Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

  title={Stochastic Normalized Gradient Descent with Momentum for Large Batch Training},
  author={Shen-Yi Zhao and Yin-Peng Xie and Wu-Jun Li},
Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically… 

Figures and Tables from this paper

Revisiting Outer Optimization in Adversarial Training

It is proved that the convergence rate of ENGM is independent of the variance of the gradients, and thus, it is suitable for AT, and alleviates major shortcomings of AT including robust overfitting and high sensitivity to hyperparameter settings.

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

An extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers is performed, identifying a significantly reduced subset of specific algorithms and parameter choices that generally provided competitive results in the authors' experiments.



On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

Scaling SGD Batch Size to 32K for ImageNet Training

Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet.

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.

Don't Use Large Mini-Batches, Use Local SGD

This work proposes a \emph{post-local} SGD and shows that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency and scalability.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

A thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay, performs on par or better than well tuned SGD with momentum and Adam or AdamW in experiments on neural networks.

Beyond Convexity: Stochastic Quasi-Convex Optimization

This paper analyzes a stochastic version of NGD and proves its convergence to a global minimum for a wider class of functions: it requires the functions to be quasi-convex and locally-Lipschitz.

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

It is shown that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks, and positively correlates with the gradient norm, and contrary to standard assumptions in the literature.

Scaling Neural Machine Translation

This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation.