On Large Batch Training and Sharp Minima: A Fokker–Planck Perspective

  title={On Large Batch Training and Sharp Minima: A Fokker–Planck Perspective},
  author={Xiaowu Dai and Yuhua Zhu},
  journal={Journal of Statistical Theory and Practice},
  • Xiaowu DaiYuhua Zhu
  • Published 26 July 2020
  • Computer Science
  • Journal of Statistical Theory and Practice
We study the statistical properties of the dynamic trajectory of stochastic gradient descent (SGD). We approximate the mini-batch SGD and the momentum SGD as stochastic differential equations. We exploit the continuous formulation of SDE and the theory of Fokker–Planck equations to develop new results on the escaping phenomenon and the relationship with large batch and sharp minima. In particular, we find that the stochastic process solution tends to converge to flatter minima regardless of the… 

A sharp convergence rate for a model equation of the asynchronous stochastic gradient descent

It is proved that when the number of local workers is larger than the expected staleness, then ASGD is morecient than stochastic gradient descent, and the theoretical result suggests that longer delays result in slower convergence rate.

Bayesian mechanics for stationary processes

It follows that active states can be seen as performing active inference and well-known forms of stochastic control, which are prominent formulations of adaptive behaviour in theoretical biology and engineering.



Three Factors Influencing Minima in SGD

Through this analysis, it is found that three factors – learning rate, batch size and the variance of the loss gradients – control the trade-off between the depth and width of the minima found by SGD, with wider minima favoured by a higher ratio of learning rate to batch size.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks

It is proved that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term, and that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points, but resemble closed loops with deterministic components.

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.

Deep relaxation: partial differential equations for optimizing deep neural networks

Stochastic homogenization theory allows us to better understand the convergence of the algorithm, and a stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.

Stochastic modified equations for the asynchronous stochastic gradient descent

An optimal mini-batching strategy for ASGD via solving the optimal control problem of the associated SME is proposed and the convergence of ASGD to the SME in the continuous time limit is shown.

Stochastic Gradient Descent as Approximate Bayesian Inference

It is demonstrated that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models and a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler is proposed.

On the momentum term in gradient descent learning algorithms

  • N. Qian
  • Physics, Computer Science
    Neural Networks
  • 1999

On the importance of initialization and momentum in deep learning

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.