• Corpus ID: 238583805

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

@article{Zhuang2021MomentumCA,
  title={Momentum Centering and Asynchronous Update for Adaptive Gradient Methods},
  author={Juntang Zhuang and Yifan Ding and Tommy Tang and Nicha C. Dvornek and Sekhar C. Tatikonda and James S. Duncan},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.05454}
}
We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for t-th update, denominator uses information up to step t − 1, while numerator uses gradient at t-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. (2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g… 

References

SHOWING 1-10 OF 56 REFERENCES
A Sufficient Condition for Convergences of Adam and RMSProp
TLDR
An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization.
AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods
TLDR
AdaShift is proposed, a novel adaptive learning rate method that decorrelates v_t and g_t by temporal shifting, i.e., using temporally shifted gradient $g_{t-n}$ to calculate $v_t$.
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
TLDR
New variants of Adam and AMSGrad are provided, called AdaBound and AMSBound respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD and give a theoretical proof of convergence.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Adaptive Methods for Nonconvex Optimization
TLDR
The result implies that increasing minibatch sizes enables convergence, thus providing a way to circumvent the non-convergence issues, and provides a new adaptive optimization algorithm, Yogi, which controls the increase in effective learning rate, leading to even better performance with similar theoretical guarantees on convergence.
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
TLDR
This work proposes a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions and introduces the "Frechet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score.
Lookahead Optimizer: k steps forward, 1 step back
TLDR
Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost, and can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings.
On the Variance of the Adaptive Learning Rate and Beyond
TLDR
This work identifies a problem of the adaptive learning rate, suggests warmup works as a variance reduction technique, and proposes RAdam, a new variant of Adam, by introducing a term to rectify the variance of theadaptive learning rate.
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
TLDR
A set of mild sufficient conditions are provided that guarantee the convergence for the Adam-type methods and it is proved that under these derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
...
1
2
3
4
5
...