• Corpus ID: 182952664

Reducing the variance in online optimization by transporting past gradients

  title={Reducing the variance in online optimization by transporting past gradients},
  author={S{\'e}bastien M. R. Arnold and Pierre-Antoine Manzagol and Reza Babanezhad and Ioannis Mitliagkas and Nicolas Le Roux},
Most stochastic optimization methods use gradients once before discarding them. [] Key Method In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online…

Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering

It is proved that the model-based procedure converges in the noisy quadratic setting and can match the performance of well-tuned optimizers and ultimately, this is an interesting step for constructing self-tuning optimizers.

AG-SGD: Angle-Based Stochastic Gradient Descent

An algorithm is proposed that quantifies this deviation based on the angle between the past and the current gradients which is then applied to calibrate these two gradients, generating a more accurate new gradient.

Understanding Accelerated Stochastic Gradient Descent via the Growth Condition

A trade-off between robustness and convergence rate is established, which shows that even though simple accelerated methods like HB and NAM are optimal in the deterministic case, more sophisticated design of algorithms leads to robustness in stochastic settings and achieves a better convergence rate than vanilla SGD.

Momentum Improves Normalized SGD

An improved analysis of normalized SGD is provided showing that adding momentum provably removes the need for large batch sizes on non-convex objectives and an adaptive method is provided that automatically improves convergence rates when the variance in the gradients is small.

Variance Reduction in Deep Learning: More Momentum is All You Need

The ubiquitous clustering structure of rich datasets used in deep learning is exploited to design a family of scalable variance reduced optimization procedures by combining existing optimizers with a multi-momentum strategy (Yuan et al., 2019).


  • Computer Science, Physics
This work proposes a decaying momentum (DEMON) rule, motivated by decaying the total contribution of a gradient to all future updates, which leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive.

ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm

This work considers first-order stochastic optimization from a general statistical point of view, motivating a specific form of recursive averaging of past Stochastic gradients, and concludes that the resulting algorithm, which is referred to as ROOT-SGD, matches the state-of-the-art convergence rate among online variance-reduced stochastically approximation methods.

Demon: Improved Neural Network Training With Momentum Decay

DEMON consistently outperforms other widely-used schedulers including, but not limited to, the learning rate step schedule, linear schedule, OneCycle schedule, and exponential schedule and is less sensitive to parameter tuning, which is critical to training neural networks in practice.

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

It is shown that after a suitable “burn-in” period, the objective value will monotonically decrease whenever the current iterate is not a critical point, which provides intuition behind the popular practice of learning rate “warm-up” and also yields a last-iterate guarantee.



Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Variance Reduced Stochastic Gradient Descent with Neighbors

This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase.

The Marginal Value of Adaptive Gradient Methods in Machine Learning

It is observed that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance, suggesting that practitioners should reconsider the use of adaptive methods to train neural networks.

Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification

A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.

YellowFin and the Art of Momentum Tuning

This work revisits the momentum SGD algorithm and shows that hand-tuning a single learning rate and momentum makes it competitive with Adam, and designs YellowFin, an automatic tuner for momentum and learning rate in SGD.

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite

StopWasting My Gradients: Practical SVRG

This work shows how to exploit support vectors to reduce the number of gradient computations in the later iterations of stochastic variance-reduced gradient methods and proves that the commonly-used regularized SVRG iteration is justified and improves the convergence rate.

On the importance of initialization and momentum in deep learning

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

Linear Convergence with Condition Number Independent Access of Full Gradients

This paper proposes to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function, and presents a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients.

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions.