A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics

  title={A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics},
  author={Rita Fioresi and Pratik Chaudhari and Stefano Soatto},
This paper is a step towards developing a geometric understanding of a popular algorithm for training deep neural networks named stochastic gradient descent (SGD). We built upon a recent result which observed that the noise in SGD while training typical networks is highly non-isotropic. That motivated a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from a certain diffusion matrix; namely, the covariance of the… 

Tables from this paper

Geometry Perspective Of Estimating Learning Capability Of Neural Networks

By correlating the principles of high-energy physics with the learning theory of neural networks, the paper establishes a variant of the Complexity-Action conjecture from an artificial neural network perspective.

On the Thermodynamic Interpretation of Deep Learning Systems

It is shown that, in simulations on popu-lar databases (CIFAR10, MNIST), such simplified models appear inadequate and suggests a more con-ceptual approach involving contact dynamics and Lie Group Thermodynamics.

Chaos and Complexity from Quantum Neural Network: A study with Diffusion Metric in Machine Learning

This work establishes the parametrized version of Quantum Complexity and Quantum Chaos in terms of physically relevant quantities, which are not only essential in determining the stability, but also essential in providing a very significant lower bound to the generalization capability of QNN.



On the energy landscape of deep networks

It is shown that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima and all the way down to a trivial landscape with a unique minimum.

Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks

It is proved that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term, and that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points, but resemble closed loops with deterministic components.

The Effect of Gradient Noise on the Energy Landscape of Deep Networks

It is demonstrated through experiments on fully-connected and convolutional neural networks that annealing schemes based on trivialization lead to accelerated training and also improve generalization error.

Entropy-SGD: biasing gradient descent into wide valleys

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.

On the Emergence of Invariance and Disentangling in Deep Representations

It is shown that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions.

Emergence of Invariance and Disentanglement in Deep Representations

It is shown that in a deep neural network invariance to nuisance factors is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations.

Densely Connected Convolutional Networks

The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

Deep Learning

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.

Gradient-based learning applied to document recognition

This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.

Natural Gradient Works Efficiently in Learning

  • S. Amari
  • Computer Science
    Neural Computation
  • 1998
The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters.