On the momentum term in gradient descent learning algorithms

@article{Qian1999OnTM,
  title={On the momentum term in gradient descent learning algorithms},
  author={Ning Qian},
  journal={Neural networks : the official journal of the International Neural Network Society},
  year={1999},
  volume={12 1},
  pages={
          145-151
        }
}
  • N. Qian
  • Published 1999
  • Physics, Computer Science
  • Neural networks : the official journal of the International Neural Network Society

Figures from this paper

On the influence of momentum acceleration on online learning

  • K. YuanBicheng YingA. Sayed
  • Computer Science
    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
The results establish that momentum methods are equivalent to the standard stochastic gradient method with a re-scaled (larger) step-size value, and suggests a method to enhance performance in the Stochastic setting by tuning the momentum parameter over time.

Continuous Time Analysis of Momentum Methods

This work focuses on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm, and proves three continuous time approximations of discrete algorithms of the discrete algorithms.

A Global Minimization Algorithm Based on a Geodesic of a Lagrangian Formulation of Newtonian Dynamics

A novel adaptive steepest descent is obtained by applying the first-order update rule to the Rosenbrock- and Griewank-type potentials and determining the global minimum in most cases from various initial points.

Momentum Accelerates Evolutionary Dynamics

This work combines momentum from machine learning with evolutionary dynamics, using information divergences as Lyapunov functions to show that momentum accelerates the convergence of evolutionary dynamics including the replicator equation and Euclidean gradient descent on populations.

Analysis Of Momentum Methods

This work shows that, contrary to popular belief, standard implementations of fixed momentum methods do no more than act to rescale the learning rate, and shows that the momentum method converges to a gradient flow, with a momentum-dependent time-rescaling, using the method of modified equations from numerical analysis.

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

Compared with existed algorithms, the novel algorithm can get more sparse network structure, namely it forces weights to become smaller during the training and can eventually removed after the training, which means that it can simply the network structure and lower operation time.

Convergence of Momentum-Based Stochastic Gradient Descent

  • Ruinan JinXingkang He
  • Computer Science
    2020 IEEE 16th International Conference on Control & Automation (ICCA)
  • 2020
It is proved that the m SGD algorithm is almost surely convergent at each trajectory, and the convergence rate of mSGD is analyzed.

Just a Momentum: Analytical Study of Momentum-Based Acceleration Methods in Paradigmatic High-Dimensional Non-Convex Problem

This work uses dynamical mean field theory techniques to describe analytically the average behaviour of several algorithms including heavy-ball momentum and Nesterov acceleration in a prototypical non-convex model: the (spiked) matrix-tensor model.

Just a Momentum : Analytical Study of Momentum-Based Acceleration Methods Methods in Paradigmatic High-Dimensional Non-Convex Problems

This work uses dynamical mean field theory techniques to describe analytically the average behaviour of several algorithms including heavy-ball momentum and Nesterov acceleration in a prototypical non-convex model: the (spiked) matrix-tensor model.
...

References

SHOWING 1-9 OF 9 REFERENCES

Increased rates of convergence through learning rate adaptation

Learning internal representations

It is proved that the number of examples required to ensure good generalisation from a representation learner obeys and that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.

Learning to Solve Random-Dot Stereograms of Dense and Transparent Surfaces with Recurrent Backpropagation

The recurrent backpropagation learning algorithm of Pineda (1987) is used to construct network models with lateral and feedback connections that can solve the correspondence problem for random-dot stereograms.

Optimal Brain Damage

A class of practical and nearly optimal schemes for adapting the size of a neural network by using second-derivative information to make a tradeoff between network complexity and training set error is derived.

The Computational Brain

Parallel distributed processing (Vol

  • 1986