# On the Iteration Complexity of Hypergradient Computation

@article{Grazzi2020OnTI, title={On the Iteration Complexity of Hypergradient Computation}, author={Riccardo Grazzi and Luca Franceschi and Massimiliano Pontil and Saverio Salzo}, journal={ArXiv}, year={2020}, volume={abs/2006.16218} }

We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We…

## 61 Citations

### Convergence Properties of Stochastic Hypergradients

- Computer Science, MathematicsAISTATS
- 2021

This work provides iteration complexity bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation.

### Amortized Implicit Differentiation for Stochastic Bilevel Optimization

- Computer ScienceICLR
- 2022

This analysis shows algorithms based on inexact implicit differentiation and a warm-start strategy to amortize the estimation of the exact gradient to match the computational complexity of oracle methods that have access to an unbiased estimate of the gradient, thus outperforming many existing results for bilevel optimization.

### Implicit differentiation for fast hyperparameter selection in non-smooth convex learning

- Computer Science, MathematicsArXiv
- 2021

This work shows that the forward-mode di-erentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian, and provides a bound on the error made on the hypergradient when the inner optimization problem is solved approximately.

### Stability and Generalization of Bilevel Programming in Hyperparameter Optimization

- Computer ScienceNeurIPS
- 2021

This paper presents an expectation bound w.r.t. the validation set based on uniform stability for the classical cross-validation algorithm, and proves that regularization terms in both the outer and inner levels can relieve the overfitting problem in gradient-based algorithms.

### A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

- Computer Science, MathematicsArXiv
- 2022

SABA, an adaptation of the celebrated SAGA algorithm in this framework, has O ( 1 T ) convergence rate, and that it achieves linear convergence under Polyak-Łojasciewicz assumption, which is the first stochastic algorithm for bilevel optimization that verifies either of these properties.

### Iterative Implicit Gradients for Nonconvex Optimization with Variational Inequality Constraints

- Computer Science
- 2022

An efficient way of obtaining the implicit gradient is proposed, taking into account of a possible large-scale structure, and error bounds with respect to the true gradients are provided.

### Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start

- Computer ScienceArXiv
- 2022

This work proposes a simple method which uses stochastic point iterations at the lower-level and projected inexact gradient descent at the upper-level to achieve order-wise optimal or near-optimal sample complexity.

### On Implicit Bias in Overparameterized Bilevel Optimization

- Computer ScienceICML
- 2022

This work delineates two standard BLO methods—cold-start and warm-start BLO—and shows that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation.

### A Near-Optimal Algorithm for Stochastic Bilevel Optimization via Double-Momentum

- Computer Science
- 2021

This work proposes a new algorithm – the Single-timescale Double-momentum Stochastic Approximation (SUSTAIN) – for tackling stochastic unconstrained bilevel optimization problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth.

### Penalty Method for Inversion-Free Deep Bilevel Optimization

- Computer ScienceArXiv
- 2019

This paper proposes a new method for solving bilevel optimization problems using the classical penalty function approach which avoids computing the inverse and can also handle additional constraints easily and proves the convergence of the method under mild conditions and shows that the exact hypergradient is obtained asymptotically.

## References

SHOWING 1-10 OF 32 REFERENCES

### Truncated Back-propagation for Bilevel Optimization

- Computer ScienceAISTATS
- 2019

It is found that optimization with the approximate gradient computed using few-step back-propagation often performs comparably to optimized with the exact gradient, while requiring far less memory and half the computation time.

### Penalty Method for Inversion-Free Deep Bilevel Optimization

- Computer ScienceArXiv
- 2019

This paper proposes a new method for solving bilevel optimization problems using the classical penalty function approach which avoids computing the inverse and can also handle additional constraints easily and proves the convergence of the method under mild conditions and shows that the exact hypergradient is obtained asymptotically.

### Meta-learning with differentiable closed-form solvers

- Computer ScienceICLR
- 2019

The main idea is to teach a deep network to use standard machine learning tools, such as ridge regression, as part of its own internal model, enabling it to quickly adapt to novel data.

### OptNet: Differentiable Optimization as a Layer in Neural Networks

- Computer ScienceICML
- 2017

OptNet is presented, a network architecture that integrates optimization problems (here, specifically in the form of quadratic programs) as individual layers in larger end-to-end trainable deep networks, and shows how techniques from sensitivity analysis, bilevel optimization, and implicit differentiation can be used to exactly differentiate through these layers.

### Forward and Reverse Gradient-Based Hyperparameter Optimization

- Computer ScienceICML
- 2017

We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic…

### Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

- Computer ScienceICML
- 2019

A key feature of the results is that, when the number of tasks grows and their variance is relatively small, the learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term.

### Practical Bayesian Optimization of Machine Learning Algorithms

- Computer ScienceNIPS
- 2012

This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.

### Automatic differentiation in machine learning: a survey

- Computer ScienceJ. Mach. Learn. Res.
- 2017

By precisely defining the main differentiation techniques and their interrelationships, this work aims to bring clarity to the usage of the terms “autodiff’, “automatic differentiation”, and “symbolic differentiation" as these are encountered more and more in machine learning settings.

### Optimizing Millions of Hyperparameters by Implicit Differentiation

- Computer ScienceAISTATS
- 2020

An algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations is proposed and used to train modern network architectures with millions of weights and millions of hyper-parameters.

### Differentiable Convex Optimization Layers

- Computer ScienceNeurIPS
- 2019

This paper introduces disciplined parametrized programming, a subset of disciplined convex programming, and demonstrates how to efficiently differentiate through each of these components, allowing for end-to-end analytical differentiation through the entire convex program.