# Speed learning on the fly

@article{Mass2015SpeedLO, title={Speed learning on the fly}, author={Pierre Mass{\'e} and Yann Ollivier}, journal={ArXiv}, year={2015}, volume={abs/1511.02540} }

The practical performance of online stochastic gradient descent algorithms is highly dependent on the chosen step size, which must be tediously hand-tuned in many applications. The same is true for more advanced variants of stochastic gradients, such as SAGA, SVRG, or AdaGrad. Here we propose to adapt the step size by performing a gradient descent on the step size itself, viewing the whole performance of the learning trajectory as a function of step size. Importantly, this adaptation can be… Expand

#### Figures and Topics from this paper

#### 9 Citations

Learning with Random Learning Rates

- Computer Science, Mathematics
- ECML/PKDD
- 2019

Alrao could save time when testing deep learning models: a range of models could be quickly assessed with Alrao, and the most promising models could then be trained more extensively. Expand

Autour de L'Usage des gradients en apprentissage statistique. (Around the Use of Gradients in Machine Learning)

- Computer Science
- 2017

A local convergence theorem is proved for the classical dynamical system optimisation algorithm called RTRL, in a non linear setting, and the LLR algorithm is formalised, by replacing these informations by a non biased, low dimension, random approximation. Expand

2 All Learning Rates At Once : Description

- 2018

Hyperparameter tuning is a bothersome step in the training of deep learning models. One of the most sensitive hyperparameters is the learning rate of the gradient descent. We present the All Learning… Expand

Barzilai-Borwein Step Size for Stochastic Gradient Descent

- Computer Science, Mathematics
- NIPS
- 2016

The Barzilai-Borwein (BB) method is proposed to be used to automatically compute step sizes for SGD and its variant: stochastic variance reduced gradient (SVRG) method, which leads to two algorithms: SGD-BB and SVRG-BB, which is superior to some advanced SGD variants. Expand

Nesterov's accelerated gradient and momentum as approximations to regularised update descent

- Mathematics, Computer Science
- 2017 International Joint Conference on Neural Networks (IJCNN)
- 2017

It is shown that a new algorithm, which is term Regularised Gradient Descent, can converge more quickly than either Nesterov's algorithm or the classical momentum algorithm. Expand

Variance-based stochastic extragradient methods with line search for stochastic variational inequalities

- Mathematics
- 2017

A dynamic sampled stochastic approximated (DS-SA) extragradient method for stochastic variational inequalities (SVI) is proposed that is \emph{robust} with respect to an unknown Lipschitz constant… Expand

Data-driven algorithm selection and tuning in optimization and signal processing

- Computer Science
- Ann. Math. Artif. Intell.
- 2021

The goal is to train machine learning methods to automatically improve the performance of optimization and signal processing algorithms and shows that there exists a learning algorithm that, with high probability, will select the algorithm that optimizes the average performance on an input set of problem instances with a given distribution. Expand

Data-driven Algorithm Selection and Parameter Tuning: Two Case studies in Optimization and Signal Processing

- Computer Science, Mathematics
- ArXiv
- 2019

The goal is to train machine learning methods to automatically improve the performance of optimization and signal processing algorithms and uses this approach to improve two popular data processing subroutines in data science: stochastic gradient descent and greedy methods in compressed sensing. Expand

Autour De L'Usage des gradients en apprentissage statistique

- Mathematics
- 2017

Nous etablissons un theoreme de convergence locale de l'algorithme classique d'optimisation de systeme dynamique RTRL, applique a un systeme non lineaire. L'algorithme RTRL est un algorithme en… Expand

#### References

SHOWING 1-10 OF 11 REFERENCES

No more pesky learning rates

- Mathematics, Computer Science
- ICML
- 2013

The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems. Expand

Large-Scale Machine Learning with Stochastic Gradient Descent

- Computer Science
- COMPSTAT
- 2010

A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. Expand

Gradient-based Hyperparameter Optimization through Reversible Learning

- Mathematics, Computer Science
- ICML
- 2015

This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. Expand

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

- Computer Science, Mathematics
- NIPS
- 2013

It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive. Expand

Natural Gradient Works Efficiently in Learning

- Computer Science, Mathematics
- Neural Computation
- 1998

The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. Expand

Minimizing finite sums with the stochastic average gradient

- Mathematics, Computer Science
- Math. Program.
- 2017

Numerical experiments indicate that the new SAG method often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies. Expand

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

- Computer Science, Mathematics
- NIPS
- 2014

This work introduces a new optimisation method called SAGA, which improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Expand

A Stochastic Approximation Method

- Mathematics
- 2007

Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown tot he experiment, and it is desire to find the… Expand

Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26 of Applied Mathematical Sciences

- Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26 of Applied Mathematical Sciences
- 1978