# A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics

@article{Fioresi2020AGI, title={A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics}, author={Rita Fioresi and Pratik Chaudhari and Stefano Soatto}, journal={Entropy}, year={2020}, volume={22} }

This paper is a step towards developing a geometric understanding of a popular algorithm for training deep neural networks named stochastic gradient descent (SGD). We built upon a recent result which observed that the noise in SGD while training typical networks is highly non-isotropic. That motivated a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from a certain diffusion matrix; namely, the covariance of the…

## Tables from this paper

## 3 Citations

### Geometry Perspective Of Estimating Learning Capability Of Neural Networks

- Computer ScienceArXiv
- 2020

By correlating the principles of high-energy physics with the learning theory of neural networks, the paper establishes a variant of the Complexity-Action conjecture from an artificial neural network perspective.

### On the Thermodynamic Interpretation of Deep Learning Systems

- Computer ScienceGSI
- 2021

It is shown that, in simulations on popu-lar databases (CIFAR10, MNIST), such simpliﬁed models appear inadequate and suggests a more con-ceptual approach involving contact dynamics and Lie Group Thermodynamics.

### Chaos and Complexity from Quantum Neural Network: A study with Diffusion Metric in Machine Learning

- Computer ScienceJournal of High Energy Physics
- 2021

This work establishes the parametrized version of Quantum Complexity and Quantum Chaos in terms of physically relevant quantities, which are not only essential in determining the stability, but also essential in providing a very significant lower bound to the generalization capability of QNN.

## References

SHOWING 1-10 OF 12 REFERENCES

### On the energy landscape of deep networks

- Computer Science
- 2015

It is shown that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima and all the way down to a trivial landscape with a unique minimum.

### Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks

- Computer Science2018 Information Theory and Applications Workshop (ITA)
- 2018

It is proved that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term, and that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points, but resemble closed loops with deterministic components.

### The Effect of Gradient Noise on the Energy Landscape of Deep Networks

- Physics
- 2015

It is demonstrated through experiments on fully-connected and convolutional neural networks that annealing schemes based on trivialization lead to accelerated training and also improve generalization error.

### Entropy-SGD: biasing gradient descent into wide valleys

- Computer ScienceICLR
- 2017

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.

### On the Emergence of Invariance and Disentangling in Deep Representations

- Computer ScienceArXiv
- 2017

It is shown that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions.

### Emergence of Invariance and Disentanglement in Deep Representations

- Computer Science2018 Information Theory and Applications Workshop (ITA)
- 2018

It is shown that in a deep neural network invariance to nuisance factors is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations.

### Densely Connected Convolutional Networks

- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017

The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

### Deep Learning

- Computer ScienceNature
- 2015

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.

### Gradient-based learning applied to document recognition

- Computer ScienceProc. IEEE
- 1998

This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.

### Natural Gradient Works Efficiently in Learning

- Computer ScienceNeural Computation
- 1998

The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters.