# On the Iteration Complexity of Hypergradient Computation

@article{Grazzi2020OnTI, title={On the Iteration Complexity of Hypergradient Computation}, author={Riccardo Grazzi and Luca Franceschi and Massimiliano Pontil and Saverio Salzo}, journal={ArXiv}, year={2020}, volume={abs/2006.16218} }

We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We…

## 74 Citations

### On Stability and Generalization of Bilevel Optimization Problem

- Computer ScienceArXiv
- 2022

A fundamental connection between algorithmic stability and generalization error in different forms is established and a high probability generalization bound is given which improves the previous best one from O( √ n) to O(log n), where n is the sample size.

### Convergence Properties of Stochastic Hypergradients

- Computer Science, MathematicsAISTATS
- 2021

This work provides iteration complexity bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation.

### Bilevel Optimization: Convergence Analysis and Enhanced Design

- Computer ScienceICML
- 2021

This paper provides a comprehensive convergence rate analysis for two popular algorithms respectively based on approximate implicit differentiation (AID) and iterative differentiation (ITD) and provides a quantitative comparison between ITD and AID based approaches.

### Amortized Implicit Differentiation for Stochastic Bilevel Optimization

- Computer ScienceICLR
- 2022

This analysis shows algorithms based on inexact implicit differentiation and a warm-start strategy to amortize the estimation of the exact gradient to match the computational complexity of oracle methods that have access to an unbiased estimate of the gradient, thus outperforming many existing results for bilevel optimization.

### Implicit differentiation for fast hyperparameter selection in non-smooth convex learning

- Computer Science, MathematicsArXiv
- 2021

This work shows that the forward-mode di-erentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian, and provides a bound on the error made on the hypergradient when the inner optimization problem is solved approximately.

### Stability and Generalization of Bilevel Programming in Hyperparameter Optimization

- Computer ScienceNeurIPS
- 2021

This paper presents an expectation bound w.r.t. the validation set based on uniform stability for the classical cross-validation algorithm, and proves that regularization terms in both the outer and inner levels can relieve the overfitting problem in gradient-based algorithms.

### A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

- Computer Science, MathematicsArXiv
- 2022

SABA, an adaptation of the celebrated SAGA algorithm in this framework, has O ( 1 T ) convergence rate, and that it achieves linear convergence under Polyak-Łojasciewicz assumption, which is the first stochastic algorithm for bilevel optimization that verifies either of these properties.

### Iterative Implicit Gradients for Nonconvex Optimization with Variational Inequality Constraints

- Computer Science
- 2022

An efficient way of obtaining the implicit gradient is proposed, taking into account of a possible large-scale structure, and error bounds with respect to the true gradients are provided.

### Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start

- Computer ScienceArXiv
- 2022

This work proposes a simple method which uses stochastic point iterations at the lower-level and projected inexact gradient descent at the upper-level to achieve order-wise optimal or near-optimal sample complexity.

### On Implicit Bias in Overparameterized Bilevel Optimization

- Computer ScienceICML
- 2022

This work delineates two standard BLO methods—cold-start and warm-start BLO—and shows that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation.

## References

SHOWING 1-10 OF 32 REFERENCES

### Truncated Back-propagation for Bilevel Optimization

- Computer ScienceAISTATS
- 2019

It is found that optimization with the approximate gradient computed using few-step back-propagation often performs comparably to optimized with the exact gradient, while requiring far less memory and half the computation time.

### Hyperparameter optimization with approximate gradient

- Computer ScienceICML
- 2016

This work proposes an algorithm for the optimization of continuous hyperparameters using inexact gradient information and gives sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors.

### Bilevel Programming for Hyperparameter Optimization and Meta-Learning

- Computer ScienceICML
- 2018

We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be…

### Penalty Method for Inversion-Free Deep Bilevel Optimization

- Computer ScienceArXiv
- 2019

This paper proposes a new method for solving bilevel optimization problems using the classical penalty function approach which avoids computing the inverse and can also handle additional constraints easily and proves the convergence of the method under mild conditions and shows that the exact hypergradient is obtained asymptotically.

### Meta-learning with differentiable closed-form solvers

- Computer ScienceICLR
- 2019

The main idea is to teach a deep network to use standard machine learning tools, such as ridge regression, as part of its own internal model, enabling it to quickly adapt to novel data.

### Forward and Reverse Gradient-Based Hyperparameter Optimization

- Computer ScienceICML
- 2017

We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic…

### Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

- Computer ScienceICML
- 2019

A key feature of the results is that, when the number of tasks grows and their variance is relatively small, the learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term.

### Practical Bayesian Optimization of Machine Learning Algorithms

- Computer ScienceNIPS
- 2012

This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.

### Automatic differentiation in machine learning: a survey

- Computer ScienceJ. Mach. Learn. Res.
- 2017

By precisely defining the main differentiation techniques and their interrelationships, this work aims to bring clarity to the usage of the terms “autodiff’, “automatic differentiation”, and “symbolic differentiation" as these are encountered more and more in machine learning settings.

### Optimizing Millions of Hyperparameters by Implicit Differentiation

- Computer ScienceAISTATS
- 2020

An algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations is proposed and used to train modern network architectures with millions of weights and millions of hyper-parameters.