Conservative SPDEs as fluctuating mean field limits of stochastic gradient descent

@article{Gess2022ConservativeSA,
  title={Conservative SPDEs as fluctuating mean field limits of stochastic gradient descent},
  author={Benjamin Gess and Rishabh S Gvalani and Vitalii Konarovskyi},
  journal={ArXiv},
  year={2022},
  volume={abs/2207.05705}
}
The convergence of stochastic interacting particle systems in the mean-field limit to solutions to conservative stochastic partial differential equations is shown, with optimal rate of convergence. As a second main result, a quantitative central limit theorem for such SPDEs is derived, again with optimal rate of convergence. The results apply in particular to the convergence in the mean-field scaling of stochastic gradient descent dynamics in overparametrized, shallow neural networks to… 

Figures from this paper

LDP and CLT for SPDEs with Transport Noise

In this work we consider solutions to stochastic partial differential equations with transport noise, which are known to converge, in a suitable scaling limit, to solution of the corresponding

References

SHOWING 1-10 OF 93 REFERENCES

The Implicit Regularization of Stochastic Gradient Flow for Least Squares

TLDR
The implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression, is studied, finding that under no conditions on the data matrix $X, and across the entire optimization path, the results hold.

Remarks on uniqueness and strong solutions to deterministic and stochastic differential equations

Motivated by open problems of well posedness in fluid dynamics, two topics related to strong solutions of SDEs are discussed. The first one on stochastic flows for SDEs with non regular drift helps

Stochastic nonlinear Fokker–Planck equations

Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks

TLDR
It is shown that in the limit that the number of parameters $n$ is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as $O(n^{-1})$.

The Dean-Kawasaki equation and the structure of density fluctuations in systems of diffusing particles

TLDR
It is shown that structure-preserving discretisations of the Dean–Kawasaki equation may approximate the density fluctuations of N noninteracting diffusing particles to arbitrary order in N−1 (in suitable weak metrics).

Well-posedness of the Dean--Kawasaki and the nonlinear Dawson--Watanabe equation with correlated noise

Abstract. In this paper we prove the well-posedness of the generalized Dean–Kawasaki equation driven by noise that is white in time and colored in space. The results treat diffusion coefficients that

Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate

TLDR
The theory explains several folk arts in practice used for SGD hyperparameter tuning, such as linearly scaling the initial learning rate with batch size; and overrunning SGD with high learning rate even when the loss stops decreasing.

Clark representation formula for the solution to equation with interaction

This type of equation was introduced and studied by A. Dorogovtsev in [2]. Here we consider the measure valued process μt as a functional of the noiseW (·). It is natural question for one to ask,

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

TLDR
An efficient simulation algorithm SVAG that provably converges to the conventionally used Itˆo SDE approximation and its most famous implication, the linear scaling rule, to hold.

Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed

TLDR
It is theoretically shown that two-layer neural networks (2LNN) with only a few neurons can beat the performance of kernel learning on a simple Gaussian mixture classification task and illustrates how over-parametrising the neural network leads to faster convergence, but does not improve its final performance.
...