Amortized Proximal Optimization

  title={Amortized Proximal Optimization},
  author={Juhan Bae and Paul Vicol and Jeff Z. HaoChen and Roger B. Grosse},
We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how… 


Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Understanding Short-Horizon Bias in Stochastic Meta-Optimization
Short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes, and is introduced as a toy problem, a noisy quadratic cost function, on which it is analyzed.
First-Order Preconditioning via Hypergradient Descent
First-order preconditioning (FOP) is introduced, a fast, scalable approach that generalizes previous work on hypergradient descent and is able to improve the performance of standard deep learning optimizers on visual classification and reinforcement learning tasks with minimal computational overhead.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Learned optimizers that outperform on wall-clock and validation loss
This work proposes a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance, and is able to learn optimizers that train networks to better generalization than first order methods.
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
It is experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum.
Scalable Bayesian Optimization Using Deep Neural Networks
This work shows that performing adaptive basis function regression with a neural network as the parametric form performs competitively with state-of-the-art GP-based approaches, but scales linearly with the number of data rather than cubically, which allows for a previously intractable degree of parallelism.
Unbiased Online Recurrent Optimization
The novel Unbiased Online Recurrent Optimization (UORO) algorithm allows for online learning of general recurrent computational graphs such as recurrent network models and performs well thanks to the unbiasedness of its gradients.
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis
This work proposes a novel approximation that is provably better than KFAC and amendable to cheap partial updates, which consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective.
Online Learning Rate Adaptation with Hypergradient Descent
We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a