# Quasi-hyperbolic momentum and Adam for deep learning

@article{Ma2019QuasihyperbolicMA, title={Quasi-hyperbolic momentum and Adam for deep learning}, author={Jerry Ma and Denis Yarats}, journal={ArXiv}, year={2019}, volume={abs/1810.06801} }

Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically…

## Figures, Tables, and Topics from this paper

## 68 Citations

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

- Computer Science, MathematicsArXiv
- 2021

This paper proposes a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for quadratic optimization that can improve stochastic gradient descent (SGD) and Adam.

Decaying momentum helps neural network training

- Computer ScienceArXiv
- 2019

A decaying momentum (Demon) rule is proposed, motivated by decaying the total contribution of a gradient to all future updates, which leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive.

DEMON: MOMENTUM DECAY FOR IMPROVED NEU-

Momentum is a popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (DEMON) rule, motivated by decaying the total contribution of a gradient to all future…

Demon: Momentum Decay for Improved Neural Network Training

- Computer Science
- 2021

A decaying momentum (Demon) rule is proposed, motivated by decaying the total contribution of a gradient to all future updates, which leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive.

Stochastic Gradient Descent with Large Learning Rate

- Mathematics, Computer ScienceArXiv
- 2020

The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum.

D ECAYING MOMENTUM HELPS NEURAL NETWORK TRAINING

- 2019

Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (DEMON) rule, motivated by decaying the total contribution of a gradient to…

RAdam : perform 4 iterations of momentum SGD , then use Adam with fixed warmup

- 2019

Adaptive optimization algorithms such as Adam (Kingma & Ba, 2014) are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate.…

Understanding the Role of Momentum in Stochastic Gradient Methods

- Computer Science, MathematicsNeurIPS
- 2019

The general formulation of QHM is used to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions, and sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.

Gradient descent with momentum - to accelerate or to super-accelerate?

- Computer Science, MathematicsArXiv
- 2020

It is shown explicitly that super-accelerating the momentum algorithm is beneficial, not only for this idealized problem, but also for several synthetic loss landscapes and for the MNIST classification task with neural networks.

Provable Convergence of Nesterov Accelerated Method for Over-Parameterized Neural Networks

- Computer ScienceArXiv
- 2021

It is proved that the error of NAG converges to zero at a linear convergence rate 1 − Θ(1/ √ κ), where κ > 1 is determined by the initialization and the architecture of neural network.

## References

SHOWING 1-10 OF 70 REFERENCES

On the importance of initialization and momentum in deep learning

- Computer ScienceICML
- 2013

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization

- Computer Science, Mathematics2018 Information Theory and Applications Workshop (ITA)
- 2018

The results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of minibatching, and provide a viable (and provable) alternative, which significantly improves over HB, NAG, and SGD's performance.

Adam: A Method for Stochastic Optimization

- Computer Science, MathematicsICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Fixing Weight Decay Regularization in Adam

- Computer Science, MathematicsArXiv
- 2017

This work decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets.

Fast Stochastic Variance Reduced Gradient Method with Momentum Acceleration for Machine Learning

- Computer Science, MathematicsArXiv
- 2017

The results show that FSVRG outperforms the state-of-the-art stochastic methods, including Katyusha, and is empirically studied for solving various machine learning problems such as logistic regression, ridge regression, Lasso and SVM.

YellowFin and the Art of Momentum Tuning

- Mathematics, Computer ScienceMLSys
- 2019

This work revisits the momentum SGD algorithm and shows that hand-tuning a single learning rate and momentum makes it competitive with Adam, and designs YellowFin, an automatic tuner for momentum and learning rate in SGD.

Online Learning Rate Adaptation with Hypergradient Descent

- Computer Science, MathematicsICLR
- 2018

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a…

Asynchrony begets momentum, with an application to deep learning

- Computer Science, Mathematics2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
- 2016

It is shown that running stochastic gradient descent in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration, and an important implication is that tuning the momentum parameter is important when considering different levels of asynchrony.

A PID Controller Approach for Stochastic Optimization of Deep Networks

- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018

The proposed PID method reduces much the overshoot phenomena of SGD-Momentum, and it achieves up to 50% acceleration on popular deep network architectures with competitive accuracy, as verified by the experiments on the benchmark datasets including CIFar10, CIFAR100, and Tiny-ImageNet.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

- Computer ScienceICML
- 2015

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.