• Corpus ID: 239016803

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

  title={Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization},
  author={Tao Sun and Huaming Ling and Zuoqiang Shi and Dongsheng Li and Bao Wang},
  • Tao Sun, Huaming Ling, +2 authors Bao Wang
  • Published 18 October 2021
  • Computer Science, Mathematics
  • ArXiv
Heavy ball momentum is crucial in accelerating (stochastic) gradient-based optimization algorithms for machine learning. Existing heavy ball momentum is usually weighted by a uniform hyperparameter, which relies on excessive tuning. Moreover, the calibrated fixed hyperparameter may not lead to optimal performance. In this paper, to eliminate the effort for tuning the momentum-related hyperparameter, we propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for… 

Figures and Tables from this paper

How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case Studies
It is shown that integrating momentum into neural network architectures has several remarkable theoretical and empirical benefits, including one that can overcome the vanishing gradient issues in training RNNs and neural ODEs, resulting in effective learning long-term dependencies.
AIR-Net: Adaptive and Implicit Regularization Neural Network for Matrix Completion
  • Zhemin Li, Hongxia Wang
  • Computer Science, Mathematics
  • 2021
Theoretically, it is shown that the adaptive regularization of AIR-Net enhances the implicit regularization and vanishes at the end of training, and the model’s effectiveness on various benchmark tasks is validated, indicating that the AIR- net is particularly favorable for the scenarios when the missing entries are non-uniform.


Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum
Stochastic gradient descent (SGD) with this new adaptive momentum eliminates the need for the momentum hyperparameter calibration, allows a significantly larger learning rate, accelerates DNN training, and improves final accuracy and robustness of the trained DNNs.
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule, a new NAG-style scheme for training DNNs.
Quasi-hyperbolic momentum and Adam for deep learning
The quasi-hyperbolic momentum algorithm (QHM) is proposed as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step, and a QH variant of Adam is proposed called QHAdam.
On the importance of initialization and momentum in deep learning
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.
The Marginal Value of Adaptive Gradient Methods in Machine Learning
It is observed that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance, suggesting that practitioners should reconsider the use of adaptive methods to train neural networks.
A Unified Analysis of Stochastic Momentum Methods for Deep Learning
The stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance through the uniform stability approach.
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
New variants of Adam and AMSGrad are provided, called AdaBound and AMSBound respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD and give a theoretical proof of convergence.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Decoupled Weight Decay Regularization
This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.
Accelerating SGD with momentum for over-parameterized learning
MaSS is introduced, it is proved that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting, and the practically important question of the dependence of the convergence rate and optimal hyper-parameters on the mini- batch size is analyzed.