• Corpus ID: 204788825

First-Order Preconditioning via Hypergradient Descent

  title={First-Order Preconditioning via Hypergradient Descent},
  author={Theodore H. Moskovitz and Rui Wang and Janice Lan and Sanyam Kapoor and Thomas Miconi and Jason Yosinski and Aditya Rawal},
Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space.These difficulties can be addressed by second-order approaches that apply a pre-conditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher… 
2 Citations
A Generalizable Approach to Learning Optimizers
This work describes a system designed from a generalization-first perspective, learning to update optimizer hyperparameters instead of model parameters directly using novel features, actions, and a reward function that outperforms Adam at all neural network tasks including on modalities not seen during training.
Organizing recurrent network dynamics by task-computation to enable continual learning
A novel learning rule is developed designed to minimize interference between sequentially learned tasks in recurrent networks and it is shown that networks trained using this approach can reuse similar dynamical structures across similar tasks.


Preconditioned Stochastic Gradient Descent
  • Xi-Lin Li
  • Mathematics, Computer Science
    IEEE Transactions on Neural Networks and Learning Systems
  • 2018
Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditionsed SGD can efficiently solve many challenging problems like the training of a deep neural network or a recurrent neural network requiring extremely long-term memories.
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.
Optimization Methods for Large-Scale Machine Learning
A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
This work shows that this much-older Polyak-Lojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years, leading to simple proofs of linear convergence of these methods.
Online Learning Rate Adaptation with Hypergradient Descent
We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a
Gradient-based Hyperparameter Optimization through Reversible Learning
This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.
Natural Neural Networks
A specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network.