• Corpus ID: 246822424

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

  title={The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance},
  author={Matthew Faw and Isidoros Tziotis and Constantine Caramanis and Aryan Mokhtari and Sanjay Shakkottai and Rachel A. Ward},
  booktitle={Annual Conference Computational Learning Theory},
We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient… 

Figures from this paper

Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization

The AdaSpider method, proposed, is the first parameter-free non-convex variance-reduction method in the sense that it does not require the knowledge of problem-dependent parameters, such as smoothness constant L, target accuracy (cid:15) or any bound on gradient norms.

Provable Adaptivity in Adam

It is argued that Adam can adapt to the local smoothness condition, justifying the adaptation of Adam and shed light on the benefit of adaptive gradient methods over non-adaptive ones.

A Simple Convergence Proof of Adam and Adagrad

This work provides a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients and obtains the tightest dependency on the heavy ball momentum among all previous convergence bounds.



AdaGrad stepsizes: sharp convergence over nonconvex landscapes

The norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the O(log(N)/ √ N) rate in the stochastic setting, and at the optimal O(1/N) rates in the batch (non-stochastic) setting – in this sense, the convergence guarantees are “sharp”.

On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes

This paper theoretically analyzes in the convex and non-convex settings a generalized version of the AdaGrad stepsizes, and shows sufficient conditions for these stepsizes to achieve almost sure asymptotic convergence of the gradients to zero, proving the first guarantee for generalized AdaGrad Stepsizes in the non- Convex setting.

On the Convergence of mSGD and AdaGrad for Stochastic Optimization

It is proved that the iterates of mSGD are asymptotically convergent to a connected set of stationary points with probability one, which is more general than existing works on subsequence convergence or convergence of time averages.

A High Probability Analysis of Adaptive SGD with Momentum

A high probability analysis for adaptive and momentum algorithms, under weak assumptions on the function, stochastic gradients, and learning rates is presented and it is used to prove for the first time the convergence of the gradients to zero in high probability in the smooth nonconvex setting for Delayed AdaGrad with momentum.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

On Stochastic Moving-Average Estimators for Non-Convex Optimization

This paper analyzes various stochastic methods (existing or newly proposed) based on the variance recursion property of SEMA for three families of non-convex optimization, namely standard Stochastic non- Convex minimization, stochastically non-Convex strongly-concave min-max optimization, and stochastics bilevel optimization.

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

A set of mild sufficient conditions are provided that guarantee the convergence for the Adam-type methods and it is proved that under these derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization.

Asymptotic study of stochastic adaptive algorithm in non-convex landscape

This paper studies some asymptotic properties of adaptive algorithms widely used in optimization and machine learning, and among them Adagrad and Rmsprop, which are involved in most of the blackbox

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

A sharp analysis of a recently proposed adaptive gradient method namely partially adaptive momentum estimation method (Padam) (Chen and Gu, 2018), which admits many existing adaptive gradient methods such as RMSProp and AMSGrad as special cases, shows that Padam converges to a first-order stationary point at the rate of O\big.

Weighted AdaGrad with Unified Momentum

AdaUSM is proposed, which has the main characteristics that it incorporates a unified momentum scheme which covers both the heavy ball momentum and the Nesterov accelerated gradient momentum, and adopts a novel weighted adaptive learning rate that can unify the learning rates of AdaGrad, AccAdaGrad, Adam, and RMSProp.