• Corpus ID: 198166077

Neural Optimizers with Hypergradients for Tuning Parameter-Wise Learning Rates

  title={Neural Optimizers with Hypergradients for Tuning Parameter-Wise Learning Rates},
  author={Jie Fu and Ritchie Ng and Danlu Chen and Ilija Ilievski and Tat-Seng Chua and Fu Ng Chen and Ilievski Pal Chua},
Recent studies show that LSTM-based neural optimizers are competitive with state-of-theart hand-designed optimization methods for short horizons. Existing neural optimizers learn how to update the optimizee parameters, namely, predicting the product of learning rates and gradients directly and we suspect it is the reason why the training task becomes unnecessarily difficult. Instead, we train a neural optimizer to only control the learning rates of another optimizer using gradients of the… 

Figures from this paper

Adaptive Multi-level Hyper-gradient Descent
The experiments on several network architectures including feed-forward networks, LeNet-5 and ResNet-34 show that the proposed multi-level adaptive approach can outperform baseline adaptive methods in a variety circumstances with statistical significance.
Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation
This work extends existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuoushyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts.


Learned Optimizers that Scale and Generalize
This work introduces a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead, by introducing a novel hierarchical RNN architecture with minimal per-parameter overhead.
Online Batch Selection for Faster Training of Neural Networks
This work investigates online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam, and proposes a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank.
Learning Gradient Descent: Better Generalization and Longer Horizons
This paper proposes a new learning-to-learn model and some useful and practical tricks, and demonstrates the effectiveness of the algorithms on a number of tasks, including deep MLPs, CNNs, and simple LSTMs.
The Marginal Value of Adaptive Gradient Methods in Machine Learning
It is observed that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance, suggesting that practitioners should reconsider the use of adaptive methods to train neural networks.
Online Learning Rate Adaptation with Hypergradient Descent
We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
No more pesky learning rates
The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems.
Learning to learn by gradient descent by gradient descent
This paper shows how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way.
Gradient-based Hyperparameter Optimization through Reversible Learning
This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.
SGDR: Stochastic Gradient Descent with Restarts
This paper proposes a simple restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks and empirically study its performance on CIFar-10 and CIFAR-100 datasets.