Corpus ID: 3508234

Online Learning Rate Adaptation with Hypergradient Descent

@article{Baydin2018OnlineLR,
  title={Online Learning Rate Adaptation with Hypergradient Descent},
  author={Atilim Gunes Baydin and Robert Cornish and David Mart{\'i}nez-Rubio and Mark W. Schmidt and Frank D. Wood},
  journal={ArXiv},
  year={2018},
  volume={abs/1703.04782}
}
We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it significantly reduces the need for the manual tuning of the initial learning rate for these commonly used algorithms. Our method works by… Expand
Adapting the Learning Rate of the Learning Rate in Hypergradient Descent
TLDR
This work investigated the use of two datasets and two optimization methods for doing an effective adjustment of the learning rate when the objective function was convex and $L$-smooth and achieved an effectiveadjustment of theLearning rate. Expand
Learning the Learning Rate for Gradient Descent by Gradient Descent
This paper introduces an algorithm inspired from the work of Franceschi et al. (2017) for automatically tuning the learning rate while training neural networks. We formalize this problem asExpand
Adaptive Learning Rate and Momentum for Training Deep Neural Networks
TLDR
A fast training method motivated by the nonlinear Conjugate Gradient with Quadratic line-search (CGQ) framework that yields faster convergence than other local solvers and has better generalization capability (test set accuracy). Expand
Gradient Descent: The Ultimate Optimizer
TLDR
This work proposes to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn thehyper-hyperparameters by gradient ascent as well, and so on ad infinitum. Expand
Using Statistics to Automate Stochastic Optimization
TLDR
This work designs an explicit statistical test that determines when the dynamics of stochastic gradient descent reach a stationary distribution and proposes an approach that automates the most common hand-tuning heuristic: use a constant learning rate until "progress stops," then drop. Expand
Step-size Adaptation Using Exponentiated Gradient Updates
Optimizers like Adam and AdaGrad have been very successful in training large-scale neural networks. Yet, the performance of these methods is heavily dependent on a carefully tuned learning rateExpand
Learning with Random Learning Rates
TLDR
Alrao could save time when testing deep learning models: a range of models could be quickly assessed with Alrao, and the most promising models could then be trained more extensively. Expand
Statistical Adaptive Stochastic Gradient Methods
TLDR
A statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in stochastic gradient methods, based on a new statistical test for detecting stationarity when using a constant step size. Expand
Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm
TLDR
This work describes the structure of the gradient of a validation error w.r.t. the learning rate, the hypergradient, and based on this it introduces a novel online algorithm that adaptively interpolates between the recently proposed techniques of Franceschi et al. (2017) and Baydin (2017), featuring increased stability and faster convergence. Expand
AdaS: Adaptive Scheduling of Stochastic Gradients
TLDR
This work attempts to answer a question of interest to both researchers and practitioners, namely how much knowledge is gained in iterative training of deep neural networks, and proposes a new algorithm called Adaptive Scheduling (AdaS) that utilizes these derived metrics to adapt the SGD learning rate proportionally to the rate of change in knowledge gain over successive iterations. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
No more pesky learning rates
TLDR
The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems. Expand
Learning Rate Adaptation in Stochastic Gradient Descent
TLDR
The main feature of the proposed learning rate adaptation scheme is that it exploits gradient-related information from the current as well as the two previous pattern presentations to provide some kind of stabilization in the value of the learning rate and helps the stochastic gradient descent to exhibit fast convergence and a high rate of success. Expand
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand
Local Gain Adaptation in Stochastic Gradient Descent
TLDR
The limitations of this approach are discussed, and an alternative is developed by extending Sutton''s work on linear systems to the general, nonlinear case, and the resulting online algorithms are computationally little more expensive than other acceleration techniques, and do not assume statistical independence between successive training patterns. Expand
Gradient-Based Optimization of Hyperparameters
TLDR
This article presents a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameter gradient involving second derivatives of the training criterion. Expand
Gradient-based Hyperparameter Optimization through Reversible Learning
TLDR
This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. Expand
Increased rates of convergence through learning rate adaptation
TLDR
A study of Steepest Descent and an analysis of why it can be slow to converge and four heuristics for achieving faster rates of convergence are proposed. Expand
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Convergence Analysis of an Adaptive Method of Gradient Descent
This dissertation studies the convergence of an adaptive method of gradient descent called Hypergradient Descent. We review some methods of gradient descent and their proofs of convergence for smoothExpand
A direct adaptive method for faster backpropagation learning: the RPROP algorithm
TLDR
A learning algorithm for multilayer feedforward networks, RPROP (resilient propagation), is proposed that performs a local adaptation of the weight-updates according to the behavior of the error function to overcome the inherent disadvantages of pure gradient-descent. Expand
...
1
2
3
4
...