# Implicit differentiation for fast hyperparameter selection in non-smooth convex learning

@article{Bertrand2021ImplicitDF, title={Implicit differentiation for fast hyperparameter selection in non-smooth convex learning}, author={Quentin Bertrand and Quentin Klopfenstein and Mathurin Massias and Mathieu Blondel and Samuel Vaiter and Alexandre Gramfort and Joseph Salmon}, journal={ArXiv}, year={2021}, volume={abs/2105.01637} }

Finding the optimal hyperparameters of a model can be cast as a bilevel optimization problem, typically solved using zero-order techniques. In this work we study ﬁrst-order methods when the inner optimization problem is convex but non-smooth. We show that the forward-mode diﬀerentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian. Using implicit diﬀerentiation, we show it is possible to leverage the non…

## Figures and Tables from this paper

## 7 Citations

Value Function Based Difference-of-Convex Algorithm for Bilevel Hyperparameter Selection Problems

- Computer ScienceICML
- 2022

This work develops a sequentially convergent Value Function based Difference-of-Convex Algorithm with inexactness (VF-iDCA) and shows that this algorithm achieves stationary solutions without LLSC and LLS assumptions for bilevel programs from a broad class of hyperparameter tuning applications.

Nonsmooth Implicit Differentiation for Machine Learning and Optimization

- Computer Science, MathematicsNeurIPS
- 2021

A nonsmooth implicit function theorem with an operational calculus is established and several applications, such as training deep equilibrium networks, training neural nets with conic optimization layers, or hyperparameter-tuning for nonsm Smooth Lasso-type models are provided.

Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start

- Computer ScienceArXiv
- 2022

A simple method which uses stochastic point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an ǫ -stationary point using O (ǫ − 2 ) and ˜ O ( ǵ − 1 ) samples for the stochastics and the deterministic setting, respectively is proposed.

Efficient and Modular Implicit Differentiation

- Computer ScienceArXiv
- 2021

This paper proposes automatic implicit diﬀerentiation, an e-cient and modular approach for implicit di-erentiation of optimization problems, and shows the ease of formulating and solving bi-level optimization problems using the framework.

Electromagnetic neural source imaging under sparsity constraints with SURE-based hyperparameter tuning

- Computer Science
- 2021

This paper proposes to use a proxy of the Stein’s Unbiased Risk Estimator (SURE) to automatically select their regularization parameters and shows that the proposed SURE approach outperforms cross-validation strategies and state-of-the-art Bayesian statistics methods both computationally and statistically.

PUDLE: Implicit Acceleration of Dictionary Learning by Backpropagation

- Computer ScienceArXiv
- 2021

The theoretical proof for these empirical results through PUDLE, a Provable Unfolded Dictionary LEarning method is offered, providing sufficient conditions on the network initialization and data distribution for model recovery and highlighting the interpretability of PULDE by deriving a mathematical relation between network weights, its output, and the training data.

Stable and Interpretable Unrolled Dictionary Learning

- Computer Science
- 2021

PUDLE’s interpretability is demonstrated, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.

## References

SHOWING 1-10 OF 127 REFERENCES

Convergence Properties of Stochastic Hypergradients

- Computer Science, MathematicsAISTATS
- 2021

This work provides iteration complexity bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation.

Hyperparameter optimization with approximate gradient

- Computer ScienceICML
- 2016

This work proposes an algorithm for the optimization of continuous hyperparameters using inexact gradient information and gives sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors.

Optimizing Millions of Hyperparameters by Implicit Differentiation

- Computer ScienceAISTATS
- 2020

An algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations is proposed and used to train modern network architectures with millions of weights and millions of hyper-parameters.

On the Iteration Complexity of Hypergradient Computation

- Computer Science, MathematicsICML
- 2020

A unified analysis is presented which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity, and suggests a hierarchy in terms of computational efficiency among the above methods.

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

- Computer ScienceICML
- 2018

We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be…

Differentiating the Value Function by using Convex Duality

- Computer ScienceAISTATS
- 2021

This work uses a well known result from convex duality theory to relax the conditions and to derive convergence rates of the derivative approximation for several classes of parametric optimization problems in Machine Learning.

A Bilevel Optimization Approach for Parameter Learning in Variational Models

- Computer Science, MathematicsSIAM J. Imaging Sci.
- 2013

This work considers a class of image denoising models incorporating $\ell_p$-norm--based analysis priors using a fixed set of linear operators and devise semismooth Newton methods for solving the resulting nonsmooth bilevel optimization problems.

Gradient-Based Optimization of Hyperparameters

- Computer Science, MathematicsNeural Computation
- 2000

This article presents a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameter gradient involving second derivatives of the training criterion.

Approximation Methods for Bilevel Programming

- Computer Science, Mathematics
- 2018

An approximation algorithm is presented for solving a class of bilevel programming problem where the inner objective function is strongly convex and its finite-time convergence analysis under different convexity assumption on the outer objective function.

Gradient-based Hyperparameter Optimization through Reversible Learning

- Computer ScienceICML
- 2015

This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.