• Corpus ID: 220250381

On the Iteration Complexity of Hypergradient Computation

@article{Grazzi2020OnTI,
  title={On the Iteration Complexity of Hypergradient Computation},
  author={Riccardo Grazzi and Luca Franceschi and Massimiliano Pontil and Saverio Salzo},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.16218}
}
We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We… 

Figures and Tables from this paper

Convergence Properties of Stochastic Hypergradients

TLDR
This work provides iteration complexity bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation.

Amortized Implicit Differentiation for Stochastic Bilevel Optimization

TLDR
This analysis shows algorithms based on inexact implicit differentiation and a warm-start strategy to amortize the estimation of the exact gradient to match the computational complexity of oracle methods that have access to an unbiased estimate of the gradient, thus outperforming many existing results for bilevel optimization.

Implicit differentiation for fast hyperparameter selection in non-smooth convex learning

TLDR
This work shows that the forward-mode di-erentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian, and provides a bound on the error made on the hypergradient when the inner optimization problem is solved approximately.

Stability and Generalization of Bilevel Programming in Hyperparameter Optimization

TLDR
This paper presents an expectation bound w.r.t. the validation set based on uniform stability for the classical cross-validation algorithm, and proves that regularization terms in both the outer and inner levels can relieve the overfitting problem in gradient-based algorithms.

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

TLDR
SABA, an adaptation of the celebrated SAGA algorithm in this framework, has O ( 1 T ) convergence rate, and that it achieves linear convergence under Polyak-Łojasciewicz assumption, which is the first stochastic algorithm for bilevel optimization that verifies either of these properties.

Iterative Implicit Gradients for Nonconvex Optimization with Variational Inequality Constraints

TLDR
An efficient way of obtaining the implicit gradient is proposed, taking into account of a possible large-scale structure, and error bounds with respect to the true gradients are provided.

Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start

TLDR
This work proposes a simple method which uses stochastic point iterations at the lower-level and projected inexact gradient descent at the upper-level to achieve order-wise optimal or near-optimal sample complexity.

On Implicit Bias in Overparameterized Bilevel Optimization

TLDR
This work delineates two standard BLO methods—cold-start and warm-start BLO—and shows that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation.

A Near-Optimal Algorithm for Stochastic Bilevel Optimization via Double-Momentum

TLDR
This work proposes a new algorithm – the Single-timescale Double-momentum Stochastic Approximation (SUSTAIN) – for tackling stochastic unconstrained bilevel optimization problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth.

Penalty Method for Inversion-Free Deep Bilevel Optimization

TLDR
This paper proposes a new method for solving bilevel optimization problems using the classical penalty function approach which avoids computing the inverse and can also handle additional constraints easily and proves the convergence of the method under mild conditions and shows that the exact hypergradient is obtained asymptotically.
...

References

SHOWING 1-10 OF 32 REFERENCES

Truncated Back-propagation for Bilevel Optimization

TLDR
It is found that optimization with the approximate gradient computed using few-step back-propagation often performs comparably to optimized with the exact gradient, while requiring far less memory and half the computation time.

Penalty Method for Inversion-Free Deep Bilevel Optimization

TLDR
This paper proposes a new method for solving bilevel optimization problems using the classical penalty function approach which avoids computing the inverse and can also handle additional constraints easily and proves the convergence of the method under mild conditions and shows that the exact hypergradient is obtained asymptotically.

Meta-learning with differentiable closed-form solvers

TLDR
The main idea is to teach a deep network to use standard machine learning tools, such as ridge regression, as part of its own internal model, enabling it to quickly adapt to novel data.

OptNet: Differentiable Optimization as a Layer in Neural Networks

TLDR
OptNet is presented, a network architecture that integrates optimization problems (here, specifically in the form of quadratic programs) as individual layers in larger end-to-end trainable deep networks, and shows how techniques from sensitivity analysis, bilevel optimization, and implicit differentiation can be used to exactly differentiate through these layers.

Forward and Reverse Gradient-Based Hyperparameter Optimization

We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

TLDR
A key feature of the results is that, when the number of tasks grows and their variance is relatively small, the learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term.

Practical Bayesian Optimization of Machine Learning Algorithms

TLDR
This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.

Automatic differentiation in machine learning: a survey

TLDR
By precisely defining the main differentiation techniques and their interrelationships, this work aims to bring clarity to the usage of the terms “autodiff’, “automatic differentiation”, and “symbolic differentiation" as these are encountered more and more in machine learning settings.

Optimizing Millions of Hyperparameters by Implicit Differentiation

TLDR
An algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations is proposed and used to train modern network architectures with millions of weights and millions of hyper-parameters.

Differentiable Convex Optimization Layers

TLDR
This paper introduces disciplined parametrized programming, a subset of disciplined convex programming, and demonstrates how to efficiently differentiate through each of these components, allowing for end-to-end analytical differentiation through the entire convex program.