• Corpus ID: 235417608

Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent

@inproceedings{Liu2020NoiseAF,
  title={Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent},
  author={Kangqiao Liu and Liu Ziyin and Masakuni Ueda},
  booktitle={International Conference on Machine Learning},
  year={2020}
}
In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one… 

Figures and Tables from this paper

SGD WITH A C ONSTANT L ARGE L EARNING R ATE C AN C ONVERGE TO L OCAL M AXIMA

This work constructs worst-case optimization problems illustrating that, when not in the regimes that the previous works often assume, SGD can exhibit many strange and potentially undesirable behaviors.

Strength of Minibatch Noise in SGD

This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum and suggests that a large learning rate can help generalization by introducing an implicit regularization.

SGD with a Constant Large Learning Rate Can Converge to Local Maxima

This work constructs worst-case optimization problems illustrating that, when not in the regimes that the previous works often assume, SGD can exhibit many strange and potentially undesirable behaviors.

Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

The PAC-Bayes-like information-theoretic bounds developed in both Xu & Raginsky (2017) and Negrea et al. (2019) are applied to obtain generalization upper bounds in terms of the KL divergence between the steady-state weight distribution of SGD with respect to a prior distribution, which suggests that the generalization of the SGD is related to the stability ofSGD.

Statistical Inference with Stochastic Gradient Algorithms

A Bernstein–von Mises-like theorem is proved to guide tuning, including for generalized posteriors that are robust to model misspecification and shows that iterate averaging with a large step size is robust to the choice of tuning parameters.

Power-Law Escape Rate of SGD

It is shown that the log loss barrier between a local minimum θ ∗ and a saddle θ s determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier decides the Escape rate.

SGD May Never Escape Saddle Points

The result suggests that the noise structure of SGD might be more important than the loss landscape in neural network training and that future research should focus on deriving the actual noise structure in deep learning.

Universal Thermodynamic Uncertainty Relation in Non-Equilibrium Dynamics

We derive a universal thermodynamic uncertainty relation (TUR) that applies to an arbitrary observable in a general Markovian system. The generality of our result allows us to make two findings: (1)

Exact Solutions of a Deep Linear Network

The analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks, implies that zero is a special point in deep neural network architecture.

Stochastic Neural Networks with Infinite Width are Deterministic

It is proved that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero, helping better understand how stochasticsity affects the learning of neural networks and potentially design better architectures for practical problems.

References

SHOWING 1-10 OF 62 REFERENCES

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

This work develops a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters, and is the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima.

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

This work studies a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics, and shows that the anisotropic noise in SGD helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well.

Stochastic Gradient Descent as Approximate Bayesian Inference

It is demonstrated that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models and a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler is proposed.

Stochastic Processes in Physics and Chemistry

N G van Kampen 1981 Amsterdam: North-Holland xiv + 419 pp price Dfl 180 This is a book which, at a lower price, could be expected to become an essential part of the library of every physical

Stochastic Processes in Physics and Chemistry

Three Factors Influencing Minima in SGD

Through this analysis, it is found that three factors – learning rate, batch size and the variance of the loss gradients – control the trade-off between the depth and width of the minima found by SGD, with wider minima favoured by a higher ratio of learning rate to batch size.

Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks

It is proved that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term, and that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points, but resemble closed loops with deterministic components.

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
...