The effective noise of stochastic gradient descent

@article{Mignacco2021TheEN,
  title={The effective noise of stochastic gradient descent},
  author={Francesca Mignacco and Pierfrancesco Urbani},
  journal={Journal of Statistical Mechanics: Theory and Experiment},
  year={2021},
  volume={2022}
}
Stochastic gradient descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently… 

Dynamical Mean Field Theory of Kernel Evolution in Wide Neural Networks

A collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points are constructed, providing a reduced description of network activity through training.

Rigorous dynamical mean field theory for stochastic gradient descent methods

Close-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator from observations on Gaussian data with empirical risk minimization are proved.

The high-d landscapes paradigm: spin-glasses, and beyond

This Chapter focuses in particular on the problem of characterizing the landscape topology and geometry, discussing techniques to count and classify its stationary points and stressing connections with the statistical physics of disordered systems and with random matrix theory.

Subaging in underparametrized deep neural networks

We consider a simple classification problem to show that the dynamics of finite–width Deep Neural Networks in the underparametrized regime gives rise to effects similar to those associated with

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.

References

SHOWING 1-10 OF 56 REFERENCES

Journal of Physics A: Mathematical and Theoretical 44

  • 483001
  • 2011

Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem

Dynamical mean-field theory from statistical physics is applied to characterize analytically the full trajectories of gradient-based algorithms in their continuous-time limit, with a warm start, and for large system sizes to unveil several intriguing properties of the landscape and the algorithms.

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

  • Yu FengY. Tu
  • Computer Science
    Proceedings of the National Academy of Sciences
  • 2021
By analyzing SGD-based learning dynamics together with the loss function landscape, a robust inverse relation between weight fluctuation and loss landscape flatness opposite to the fluctuation–dissipation relation in physics is discovered.

and a at

The xishacorene natural products are structurally unique apolar diterpenoids that feature a bicyclo[3.3.1] framework. These secondary metabolites likely arise from the well-studied, structurally

How to study a persistent active glassy system

A recently proposed scheme that allows one to study directly the dynamics in the large persistence time limit, on timescales around and well above the persistence time, is described.

Understanding deep learning is also a job for physicists

A physics-based approach to automated learning from data by means of deep neural networks may help to bridge the gap between theoretical and practical applications.

Theory of Simple Glasses: Exact Solutions in Infinite Dimensions

This pedagogical and self-contained text describes the modern mean field theory of simple structural glasses. The book begins with a thorough explanation of infinite-dimensional models in statistical

Poly-time universality and limitations of deep learning

SGD is universal even with some poly-noise while full GD or SQ algorithms are not (e.g., parities); this also gives a separation between SGD-based deep learning and statistical query algorithms.

and s

Force balance controls the relaxation time of the gradient descent algorithm in the satisfiable phase.

The relaxation dynamics of the single-layer perceptron with the spherical constraint is numerically studied and the estimated critical exponent of the relaxation time in the nonconvex region agrees very well with that of frictionless spherical particles, which have been studied in the context of the jamming transition of granular materials.
...