Corpus ID: 53112107

Quasi-hyperbolic momentum and Adam for deep learning

@article{Ma2019QuasihyperbolicMA,
  title={Quasi-hyperbolic momentum and Adam for deep learning},
  author={Jerry Ma and Denis Yarats},
  journal={ArXiv},
  year={2019},
  volume={abs/1810.06801}
}
Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically… Expand
Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization
TLDR
This paper proposes a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for quadratic optimization that can improve stochastic gradient descent (SGD) and Adam. Expand
Decaying momentum helps neural network training
TLDR
A decaying momentum (Demon) rule is proposed, motivated by decaying the total contribution of a gradient to all future updates, which leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Expand
DEMON: MOMENTUM DECAY FOR IMPROVED NEU-
    Momentum is a popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (DEMON) rule, motivated by decaying the total contribution of a gradient to all futureExpand
    Demon: Momentum Decay for Improved Neural Network Training
    TLDR
    A decaying momentum (Demon) rule is proposed, motivated by decaying the total contribution of a gradient to all future updates, which leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Expand
    Stochastic Gradient Descent with Large Learning Rate
    TLDR
    The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Expand
    D ECAYING MOMENTUM HELPS NEURAL NETWORK TRAINING
    Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (DEMON) rule, motivated by decaying the total contribution of a gradient toExpand
    RAdam : perform 4 iterations of momentum SGD , then use Adam with fixed warmup
    Adaptive optimization algorithms such as Adam (Kingma & Ba, 2014) are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate.Expand
    Understanding the Role of Momentum in Stochastic Gradient Methods
    TLDR
    The general formulation of QHM is used to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions, and sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters. Expand
    Gradient descent with momentum - to accelerate or to super-accelerate?
    TLDR
    It is shown explicitly that super-accelerating the momentum algorithm is beneficial, not only for this idealized problem, but also for several synthetic loss landscapes and for the MNIST classification task with neural networks. Expand
    Provable Convergence of Nesterov Accelerated Method for Over-Parameterized Neural Networks
    TLDR
    It is proved that the error of NAG converges to zero at a linear convergence rate 1 − Θ(1/ √ κ), where κ > 1 is determined by the initialization and the architecture of neural network. Expand
    ...
    1
    2
    3
    4
    5
    ...

    References

    SHOWING 1-10 OF 70 REFERENCES
    On the importance of initialization and momentum in deep learning
    TLDR
    It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand
    On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization
    TLDR
    The results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of minibatching, and provide a viable (and provable) alternative, which significantly improves over HB, NAG, and SGD's performance. Expand
    Adam: A Method for Stochastic Optimization
    TLDR
    This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
    Fixing Weight Decay Regularization in Adam
    TLDR
    This work decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets. Expand
    Fast Stochastic Variance Reduced Gradient Method with Momentum Acceleration for Machine Learning
    TLDR
    The results show that FSVRG outperforms the state-of-the-art stochastic methods, including Katyusha, and is empirically studied for solving various machine learning problems such as logistic regression, ridge regression, Lasso and SVM. Expand
    YellowFin and the Art of Momentum Tuning
    TLDR
    This work revisits the momentum SGD algorithm and shows that hand-tuning a single learning rate and momentum makes it competitive with Adam, and designs YellowFin, an automatic tuner for momentum and learning rate in SGD. Expand
    Online Learning Rate Adaptation with Hypergradient Descent
    We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in aExpand
    Asynchrony begets momentum, with an application to deep learning
    TLDR
    It is shown that running stochastic gradient descent in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration, and an important implication is that tuning the momentum parameter is important when considering different levels of asynchrony. Expand
    A PID Controller Approach for Stochastic Optimization of Deep Networks
    TLDR
    The proposed PID method reduces much the overshoot phenomena of SGD-Momentum, and it achieves up to 50% acceleration on popular deep network architectures with competitive accuracy, as verified by the experiments on the benchmark datasets including CIFar10, CIFAR100, and Tiny-ImageNet. Expand
    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
    TLDR
    Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Expand
    ...
    1
    2
    3
    4
    5
    ...