Corpus ID: 182952664

Reducing the variance in online optimization by transporting past gradients

@inproceedings{Arnold2019ReducingTV,
  title={Reducing the variance in online optimization by transporting past gradients},
  author={S{\'e}bastien M. R. Arnold and Pierre-Antoine Manzagol and Reza Babanezhad and Ioannis Mitliagkas and Nicolas Le Roux},
  booktitle={NeurIPS},
  year={2019}
}
Most stochastic optimization methods use gradients once before discarding them. [...] Key Method In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online…Expand
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering
TLDR
It is proved that the model-based procedure converges in the noisy quadratic setting and can match the performance of well-tuned optimizers and ultimately, this is an interesting step for constructing self-tuning optimizers. Expand
Variance reduced optimization with implicit gradient transport
TLDR
The modified form of the implicit gradient transport (MIGT) technique is introduced into the classical stochastic variance reduced methods, i.e., SVRG, which leads to a new method called S VRG–MigT, which enjoys second order information to provide faster variance reduction, but without computing the Hessian explicitly. Expand
AG-SGD: Angle-Based Stochastic Gradient Descent
TLDR
An algorithm is proposed that quantifies this deviation based on the angle between the past and the current gradients which is then applied to calibrate these two gradients, generating a more accurate new gradient. Expand
Understanding Accelerated Stochastic Gradient Descent via the Growth Condition
We study accelerated stochastic gradient descent through the lens of the growth condition. Stochastic gradient methods (SGD) with momentum, such as heavy ball (HB) and Nesterov's accelerated methodsExpand
Variance Reduction in Deep Learning: More Momentum is All You Need
TLDR
The ubiquitous clustering structure of rich datasets used in deep learning is exploited to design a family of scalable variance reduced optimization procedures by combining existing optimizers with a multi-momentum strategy (Yuan et al., 2019). Expand
DEMON: MOMENTUM DECAY FOR IMPROVED NEU-
    Momentum is a popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (DEMON) rule, motivated by decaying the total contribution of a gradient to all futureExpand
    High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails
    TLDR
    It is shown that after a suitable “burn-in” period, the objective value will monotonically decrease whenever the current iterate is not a critical point, which provides intuition behind the popular practice of learning rate “warm-up” and also yields a last-iterate guarantee. Expand
    ErrorCompensatedX: error compensation for variance reduced algorithms
    TLDR
    ErrorCompensatedX is proposed, which uses the compression error from the previous two steps to achieve the same asymptotic convergence rate with the training without compression, and provides a unified theoretical analysis framework for this class of variance reduced algorithms, with or without error compensation. Expand
    Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent
    TLDR
    The proposed TSGD algorithm has faster training speed, higher accuracy and better stability, and a learning rate that decreases linearly with the iterations is used instead of a constant learning rate. Expand
    Randomized Iterative Methods for Linear Systems: Momentum, Inexactness and Gossip
    TLDR
    This thesis is interested in the development of randomized iterative methods for solving large scale linear systems, stochastic quadratic optimization problems, the best approximation problem and quadratics optimization problems. Expand
    ...
    1
    2
    ...

    References

    SHOWING 1-10 OF 54 REFERENCES
    Variance Reduced Stochastic Gradient Descent with Neighbors
    TLDR
    This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. Expand
    Adam: A Method for Stochastic Optimization
    TLDR
    This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
    The Marginal Value of Adaptive Gradient Methods in Machine Learning
    TLDR
    It is observed that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance, suggesting that practitioners should reconsider the use of adaptive methods to train neural networks. Expand
    Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification
    TLDR
    A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation. Expand
    YellowFin and the Art of Momentum Tuning
    TLDR
    This work revisits the momentum SGD algorithm and shows that hand-tuning a single learning rate and momentum makes it competitive with Adam, and designs YellowFin, an automatic tuner for momentum and learning rate in SGD. Expand
    Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
    We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finiteExpand
    StopWasting My Gradients: Practical SVRG
    TLDR
    This work shows how to exploit support vectors to reduce the number of gradient computations in the later iterations of stochastic variance-reduced gradient methods and proves that the commonly-used regularized SVRG iteration is justified and improves the convergence rate. Expand
    On the importance of initialization and momentum in deep learning
    TLDR
    It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand
    Linear Convergence with Condition Number Independent Access of Full Gradients
    TLDR
    This paper proposes to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function, and presents a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. Expand
    Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
    TLDR
    It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. Expand
    ...
    1
    2
    3
    4
    5
    ...