• Corpus ID: 3758728

Variants of RMSProp and Adagrad with Logarithmic Regret Bounds

  title={Variants of RMSProp and Adagrad with Logarithmic Regret Bounds},
  author={Mahesh Chandra Mukkamala and Matthias Hein},
Adaptive gradient methods have become recently very popular, in particular as they have been shown to be useful in the training of deep neural networks. [] Key Method Moreover, we propose two variants SC-Adagrad and SC-RMSProp for which we show logarithmic regret bounds for strongly convex functions. Finally, we demonstrate in the experiments that these new variants outperform other adaptive gradient techniques or stochastic gradient descent in the optimization of strongly convex functions as well as in…

Figures from this paper


An affirmative answer is given by developing a variant of Adam which achieves a data-dependent O(log T ) regret bound for strongly convex functions, and under a special configuration of hyperparameters, the SAdam reduces to SC-RMSprop, a recently proposed variant of RMSprop for strongly conveyed functions, for which it provides the first data- dependent logarithmic regret bound.

Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration

This work provides proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and gives bounds on the running time of these algorithms.

SAdam: A Variant of Adam for Strongly Convex Functions

An affirmative answer is given by developing a variant of Adam which achieves a data-dependant logarithmic regret bound for strongly convex functions and reduces to SC-RMSprop, a recently proposed variant of RMSprop for strongly conveyed functions.

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

We propose a computationally-friendly adaptive learning rate schedule, ``AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We

A Sufficient Condition for Convergences of Adam and RMSProp

An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization.

Training Deep Neural Networks via Branch-and-Bound

BPGrad is a novel approximate algorithm for deep nueral network training, based on adaptive estimates of feasible region via branch-and-bound based on the assumption of Lipschitz continuity in objective function, and it is proved that it can achieve the optimal solution within finite iterations.

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

This work designs a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds.

Adam revisited: a weighted past gradients perspective

It is proved that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly, which may partially explain the good performance of ADAM in practice.

On the Convergence of Weighted AdaGrad with Momentum for Training Deep Neural Networks

Two new adaptive stochastic gradient methods are proposed called AdaHB and AdaNAG which integrate a novel weighted coordinate-wise AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum, respectively.

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

This work introduces an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization.



Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Improving Stochastic Gradient Descent with Feedback

A simple and efficient method for improving stochastic gradient descent methods by using feedback from the objective function, which is specifically applied to modify Adam, a popular algorithm for training deep neural networks.

Equilibrated adaptive learning rates for non-convex optimization

A novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner is introduced, and experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization

This framework captures and unifies much of the existing literature on adaptive online methods, including the AdaGrad and Online Newton Step algorithms as well as their diagonal versions, and obtains new convergence proofs for these algorithms that are substantially simpler than previous analyses.

Deep learning in neural networks: An overview

Logarithmic regret algorithms for online convex optimization

Several algorithms achieving logarithmic regret are proposed, which besides being more general are also much more efficient to implement, and give rise to an efficient algorithm based on the Newton method for optimization, a new tool in the field.

Dropout: a simple way to prevent neural networks from overfitting

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Unit Tests for Stochastic Optimization

A collection of unit tests for stochastic optimization that rapidly evaluates an optimization algorithm on a small-scale, isolated, and well-understood difficulty, rather than in real-world scenarios where many such issues are entangled.

ADADELTA: An Adaptive Learning Rate Method

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational