• Corpus ID: 54442348

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

  title={Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent},
  author={Xiaowu Dai and Yuhua Zhu},
Stochastic gradient descent (SGD) is almost ubiquitously used for training non-convex optimization tasks. Recently, a hypothesis proposed by Keskar et al. [2017] that large batch methods tend to converge to sharp minimizers has received increasing attention. We theoretically justify this hypothesis by providing new properties of SGD in both finite-time and asymptotic regimes. In particular, we give an explicit escaping time of SGD from a local minimum in the finite-time regime and prove that… 

Figures from this paper

Stochastic Training is Not Necessary for Generalization
It is demonstrated that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation.
Study on the Large Batch Size Training of Neural Networks Based on the Second Order Gradient
A curvature-based learning rate (CBLR) algorithm is proposed to better fit the curvature variation, a sensitive factor affecting large batch size training, across layers in a NN, and the median-curvature LR algorithm is found to gain comparable performance to Layer-wise Adaptive Rate Scaling (LARS) algorithm.
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks
A setting is devised in which it can be proved that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same networktrained with a small learning rate from the start.
A consensus-based global optimization method for high dimensional machine learning problems
This work improves recently introduced consensus-based optimization method, proposed in [R. Pinnau, C. Totzeck, O. Tse, S. Martin], by replacing the isotropic geometric Brownian motion by the component-wise one, thus removing the dimensionality dependence of the drift rate, making the method more competitive for high dimensional optimization problems.
Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscapes
This work study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation, and proposes that the non-convexity of the models and thenon-uniformity of training data is one of the causes.
The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin
This work study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of good quadratic approximation, and proposes that the non-uniformity of training data is one of its cause.
Structure Probing Neural Network Deflation
Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD
This paper proposes an algorithm incorporating decentralized distributed memory computing architecture with each node running multiprocessing parallel shared-memory SGD itself, and proves that the method guarantees ergodic convergence rates for non-convex objectives.
Reproducing Activation Function for Deep Learning
The proposed reproducing activation function can facilitate the convergence of deep learning optimization for a solution with higher accuracy than existing deep learning solvers for audio/image/video reconstruction, PDEs, and eigenvalue problems.


On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks
It is proved that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term, and that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points, but resemble closed loops with deterministic components.
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.
Three Factors Influencing Minima in SGD
Through this analysis, it is found that three factors – learning rate, batch size and the variance of the loss gradients – control the trade-off between the depth and width of the minima found by SGD, with wider minima favoured by a higher ratio of learning rate to batch size.
Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis
The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks.
Sharp Minima Can Generalize For Deep Nets
It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.
Stochastic Gradient Descent as Approximate Bayesian Inference
It is demonstrated that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models and a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler is proposed.
Towards Understanding Generalization of Deep Learning : Perspective of Loss Landscapes
The underlying reasons why deep neural networks often generalize well are investigated, and it is shown that the volume of basin of attraction of good minima dominates over that of poor minima for the landscape of loss function for deep networks.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.
Understanding deep learning requires rethinking generalization
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.