• Corpus ID: 3508234

Online Learning Rate Adaptation with Hypergradient Descent

@article{Baydin2017OnlineLR,
  title={Online Learning Rate Adaptation with Hypergradient Descent},
  author={Atilim Gunes Baydin and Robert Cornish and David Mart{\'i}nez-Rubio and Mark W. Schmidt and Frank D. Wood},
  journal={ArXiv},
  year={2017},
  volume={abs/1703.04782}
}
We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it significantly reduces the need for the manual tuning of the initial learning rate for these commonly used algorithms. Our method works by… 

Figures from this paper

Adapting the Learning Rate of the Learning Rate in Hypergradient Descent

  • Kazuma ItakuraKyohei AtarashiS. OyamaM. Kurihara
  • Computer Science, Education
    2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS)
  • 2020
This work investigated the use of two datasets and two optimization methods for doing an effective adjustment of the learning rate when the objective function was convex and $L$-smooth and achieved an effectiveadjustment of theLearning rate.

Meta-Regularization: An Approach to Adaptive Choice of the Learning Rate in Gradient Descent

Meta-Regularization modifies the objective function by adding a regularization term on the learning rate, and casts the joint updating process of parameters and learning rates into a maxmin problem, which facilitates the generation of practical algorithms.

Learning the Learning Rate for Gradient Descent by Gradient Descent

This paper introduces an algorithm inspired from the work of Franceschi et al. (2017) for automatically tuning the learning rate while training neural networks, and presents a comparison between RT-HPO and other popular HPO techniques and shows that the approach performs better in terms of the final accuracy of the trained model.

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

A fast training method motivated by the nonlinear Conjugate Gradient with Quadratic line-search (CGQ) framework that yields faster convergence than other local solvers and has better generalization capability (test set accuracy).

Differentiable Self-Adaptive Learning Rate

A novel adaptation algorithm, where learning rate is parameter specific and internal structured is proposed, which can achieve faster and higher qualifled convergence than those state-of-art optimizers.

Gradient Descent: The Ultimate Optimizer

This work proposes to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn thehyper-hyperparameters by gradient ascent as well, and so on ad infinitum.

Using Statistics to Automate Stochastic Optimization

This work designs an explicit statistical test that determines when the dynamics of stochastic gradient descent reach a stationary distribution and proposes an approach that automates the most common hand-tuning heuristic: use a constant learning rate until "progress stops," then drop.

Step-size Adaptation Using Exponentiated Gradient Updates

This paper updates the step-size scale and the gain variables with exponentiated gradient updates instead and shows that this approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule.

Learning with Random Learning Rates

Alrao could save time when testing deep learning models: a range of models could be quickly assessed with Alrao, and the most promising models could then be trained more extensively.

Statistical Adaptive Stochastic Gradient Methods

A statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in stochastic gradient methods, based on a new statistical test for detecting stationarity when using a constant step size.
...

References

SHOWING 1-10 OF 31 REFERENCES

No more pesky learning rates

The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems.

Learning Rate Adaptation in Stochastic Gradient Descent

The main feature of the proposed learning rate adaptation scheme is that it exploits gradient-related information from the current as well as the two previous pattern presentations to provide some kind of stabilization in the value of the learning rate and helps the stochastic gradient descent to exhibit fast convergence and a high rate of success.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Local Gain Adaptation in Stochastic Gradient Descent

The limitations of this approach are discussed, and an alternative is developed by extending Sutton''s work on linear systems to the general, nonlinear case, and the resulting online algorithms are computationally little more expensive than other acceleration techniques, and do not assume statistical independence between successive training patterns.

Gradient-Based Optimization of Hyperparameters

This article presents a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameter gradient involving second derivatives of the training criterion.

Gradient-based Hyperparameter Optimization through Reversible Learning

This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.

Increased rates of convergence through learning rate adaptation

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

A direct adaptive method for faster backpropagation learning: the RPROP algorithm

A learning algorithm for multilayer feedforward networks, RPROP (resilient propagation), is proposed that performs a local adaptation of the weight-updates according to the behavior of the error function to overcome the inherent disadvantages of pure gradient-descent.

Practical Recommendations for Gradient-Based Training of Deep Architectures

  • Yoshua Bengio
  • Computer Science
    Neural Networks: Tricks of the Trade
  • 2012
Overall, this chapter describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks and closes with open questions about the training difficulties observed with deeper architectures.