# First-Order Preconditioning via Hypergradient Descent

@article{Moskovitz2019FirstOrderPV, title={First-Order Preconditioning via Hypergradient Descent}, author={Theodore H. Moskovitz and Rui Wang and Janice Lan and Sanyam Kapoor and Thomas Miconi and Jason Yosinski and Aditya Rawal}, journal={ArXiv}, year={2019}, volume={abs/1910.08461} }

Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space.These difficulties can be addressed by second-order approaches that apply a pre-conditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher…

## 2 Citations

A Generalizable Approach to Learning Optimizers

- Computer Science, MathematicsArXiv
- 2021

This work describes a system designed from a generalization-first perspective, learning to update optimizer hyperparameters instead of model parameters directly using novel features, actions, and a reward function that outperforms Adam at all neural network tasks including on modalities not seen during training.

Organizing recurrent network dynamics by task-computation to enable continual learning

- Computer ScienceNeurIPS
- 2020

A novel learning rule is developed designed to minimize interference between sequentially learned tasks in recurrent networks and it is shown that networks trained using this approach can reuse similar dynamical structures across similar tasks.

## References

SHOWING 1-10 OF 37 REFERENCES

Preconditioned Stochastic Gradient Descent

- Mathematics, Computer ScienceIEEE Transactions on Neural Networks and Learning Systems
- 2018

Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditionsed SGD can efficiently solve many challenging problems like the training of a deep neural network or a recurrent neural network requiring extremely long-term memories.

A Kronecker-factored approximate Fisher matrix for convolution layers

- Mathematics, Computer ScienceICML
- 2016

Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the…

Adam: A Method for Stochastic Optimization

- Computer Science, MathematicsICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

- Computer Science, MathematicsICML
- 2015

K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.

Optimization Methods for Large-Scale Machine Learning

- Computer Science, MathematicsSIAM Rev.
- 2018

A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

- Computer Science, MathematicsECML/PKDD
- 2016

This work shows that this much-older Polyak-Lojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years, leading to simple proofs of linear convergence of these methods.

Online Learning Rate Adaptation with Hypergradient Descent

- Computer Science, MathematicsICLR
- 2018

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a…

Gradient-based Hyperparameter Optimization through Reversible Learning

- Mathematics, Computer ScienceICML
- 2015

This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.

Natural Neural Networks

- Computer Science, MathematicsNIPS
- 2015

A specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network.