• Corpus ID: 246822424

# The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

@inproceedings{Faw2022ThePO,
title={The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance},
author={Matthew Faw and Isidoros Tziotis and Constantine Caramanis and Aryan Mokhtari and Sanjay Shakkottai and Rachel A. Ward},
booktitle={Annual Conference Computational Learning Theory},
year={2022}
}
• Published in
Annual Conference…
11 February 2022
• Computer Science
We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Speciﬁcally, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient…

## Figures from this paper

• Computer Science, Mathematics
ArXiv
• 2022
The AdaSpider method, proposed, is the first parameter-free non-convex variance-reduction method in the sense that it does not require the knowledge of problem-dependent parameters, such as smoothness constant L, target accuracy (cid:15) or any bound on gradient norms.
• Computer Science
ArXiv
• 2022
• Computer Science, Mathematics
• 2020
This work provides a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients and obtains the tightest dependency on the heavy ball momentum among all previous convergence bounds.

## References

SHOWING 1-10 OF 28 REFERENCES

• Computer Science
ICML
• 2019
The norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the O(log(N)/ √ N) rate in the stochastic setting, and at the optimal O(1/N) rates in the batch (non-stochastic) setting – in this sense, the convergence guarantees are “sharp”.
• Computer Science, Mathematics
AISTATS
• 2019
This paper theoretically analyzes in the convex and non-convex settings a generalized version of the AdaGrad stepsizes, and shows sufficient conditions for these stepsizes to achieve almost sure asymptotic convergence of the gradients to zero, proving the first guarantee for generalized AdaGrad Stepsizes in the non- Convex setting.
• Computer Science
ICLR
• 2022
It is proved that the iterates of mSGD are asymptotically convergent to a connected set of stationary points with probability one, which is more general than existing works on subsequence convergence or convergence of time averages.
• Computer Science
ArXiv
• 2020
A high probability analysis for adaptive and momentum algorithms, under weak assumptions on the function, stochastic gradients, and learning rates is presented and it is used to prove for the first time the convergence of the gradients to zero in high probability in the smooth nonconvex setting for Delayed AdaGrad with momentum.
• Computer Science
J. Mach. Learn. Res.
• 2011
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
• Computer Science
ArXiv
• 2021
This paper analyzes various stochastic methods (existing or newly proposed) based on the variance recursion property of SEMA for three families of non-convex optimization, namely standard Stochastic non- Convex minimization, stochastically non-Convex strongly-concave min-max optimization, and stochastics bilevel optimization.
• Computer Science
ICLR
• 2019
A set of mild sufficient conditions are provided that guarantee the convergence for the Adam-type methods and it is proved that under these derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization.
• Computer Science, Mathematics
ArXiv
• 2020
This paper studies some asymptotic properties of adaptive algorithms widely used in optimization and machine learning, and among them Adagrad and Rmsprop, which are involved in most of the blackbox
• Computer Science
ArXiv
• 2018
A sharp analysis of a recently proposed adaptive gradient method namely partially adaptive momentum estimation method (Padam) (Chen and Gu, 2018), which admits many existing adaptive gradient methods such as RMSProp and AMSGrad as special cases, shows that Padam converges to a first-order stationary point at the rate of O\big.
• Computer Science
• 2018