# Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

@article{Vicol2021UnbiasedGE, title={Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies}, author={Paul Vicol and Luke Metz and Jascha Sohl-Dickstein}, journal={ArXiv}, year={2021}, volume={abs/2112.13835} }

Unrolled computation graphs arise in many scenarios, including training RNNs, tuning hyperparameters through unrolled optimization, and training learned optimizers. Current approaches to optimizing parameters in such computation graphs suffer from high variance gradients, bias, slow updates, or large memory usage. We introduce a method called Persistent Evolution Strategies (PES), which divides the computation graph into a series of truncated unrolls, and performs an evolution strategies-based…

## Figures and Tables from this paper

## 5 Citations

Hyper-Learning for Gradient-Based Batch Size Adaptation

- Computer Science
- 2022

This work introduces Arbiter as a new hyperparameter optimization algorithm to perform batch size adaptations for learnable scheduling heuristics using gradients from a meta-objective function, which over-comes previous heuristic constraints by enforcing a novel learning process called hyper-learning.

Amortized Proximal Optimization

- Computer ScienceArXiv
- 2022

The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule, and it is shown how APO can be used to adapt a learning rate or a structured preconditioning matrix.

Practical tradeoffs between memory, compute, and performance in learned optimizers

- Computer ScienceArXiv
- 2022

This work identifies and quantify the design features governing the memory, compute, and performance trade-offs for many learned and hand-designed optimizers and constructs a learned optimizer that is both faster and more memory than previous work.

Neural Simulated Annealing

- Business, Computer ScienceArXiv
- 2022

This work view SA from a reinforcement learning perspective and frame the proposal distribution as a policy, which can be optimised for higher solution quality given a fixed computational budget, and demonstrates that this Neural SA with such a learnt proposal distribution, parametrised by small equivariant neural networks, outperforms SA baselines on a number of problems.

Tutorial on amortized optimization for learning to optimize over continuous domains

- Computer ScienceArXiv
- 2022

This tutorial discusses the key design choices behind amortized optimization, roughly categorizing models into fully-amortized and semi-Amortized approaches, and learning methods into regression-based and objectivebased approaches.

## References

SHOWING 1-10 OF 67 REFERENCES

Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons

- Computer ScienceArXiv
- 2020

This work enables non-greediness over long horizons with a two-fold solution that derives a forward-mode differentiation algorithm for the popular momentum-based SGD optimizer, which allows for a memory cost that is constant with horizon size.

Forward and Reverse Gradient-Based Hyperparameter Optimization

- Computer ScienceICML
- 2017

We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic…

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

- Computer ScienceJ. Mach. Learn. Res.
- 2017

A novel algorithm is introduced, Hyperband, for hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations.

Variance Reduction for Evolution Strategies via Structured Control Variates

- Computer Science, MathematicsAISTATS
- 2020

A new method for improving accuracy of the ES algorithms, that as opposed to recent approaches utilizing only Monte Carlo structure of the gradient estimator, takes advantage of the underlying MDP structure to reduce the variance.

Efficient Optimization of Loops and Limits with Randomized Telescoping Sums

- Computer ScienceICML
- 2019

This work proposes randomized telescope (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates and derives a method for tuning RT estimators online to maximize a lower bound on the expected decrease in loss per unit of computation.

Unbiased Online Recurrent Optimization

- Computer ScienceICLR
- 2018

The novel Unbiased Online Recurrent Optimization (UORO) algorithm allows for online learning of general recurrent computational graphs such as recurrent network models and performs well thanks to the unbiasedness of its gradients.

Gradient-based Hyperparameter Optimization through Reversible Learning

- Computer ScienceICML
- 2015

This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.

Understanding Short-Horizon Bias in Stochastic Meta-Optimization

- Computer ScienceICLR
- 2018

Short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes, and is introduced as a toy problem, a noisy quadratic cost function, on which it is analyzed.

On the Variance of Unbiased Online Recurrent Optimization

- Computer ScienceArXiv
- 2019

The variance of the gradient estimate computed by UORO is analyzed, several possible changes to the method which reduce this variance are proposed, and a fundamental connection between its gradient estimate and the one that would be computed by REINFORCE if small amounts of noise were added to the RNN's hidden units is demonstrated.

Truncated Back-propagation for Bilevel Optimization

- Computer ScienceAISTATS
- 2019

It is found that optimization with the approximate gradient computed using few-step back-propagation often performs comparably to optimized with the exact gradient, while requiring far less memory and half the computation time.