• Corpus ID: 233715000

Implicit differentiation for fast hyperparameter selection in non-smooth convex learning

@article{Bertrand2021ImplicitDF,
  title={Implicit differentiation for fast hyperparameter selection in non-smooth convex learning},
  author={Quentin Bertrand and Quentin Klopfenstein and Mathurin Massias and Mathieu Blondel and Samuel Vaiter and Alexandre Gramfort and Joseph Salmon},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.01637}
}
Finding the optimal hyperparameters of a model can be cast as a bilevel optimization problem, typically solved using zero-order techniques. In this work we study first-order methods when the inner optimization problem is convex but non-smooth. We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian. Using implicit differentiation, we show it is possible to leverage the non… 
Value Function Based Difference-of-Convex Algorithm for Bilevel Hyperparameter Selection Problems
TLDR
This work develops a sequentially convergent Value Function based Difference-of-Convex Algorithm with inexactness (VF-iDCA) and shows that this algorithm achieves stationary solutions without LLSC and LLS assumptions for bilevel programs from a broad class of hyperparameter tuning applications.
Nonsmooth Implicit Differentiation for Machine Learning and Optimization
TLDR
A nonsmooth implicit function theorem with an operational calculus is established and several applications, such as training deep equilibrium networks, training neural nets with conic optimization layers, or hyperparameter-tuning for nonsm Smooth Lasso-type models are provided.
Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start
TLDR
A simple method which uses stochastic point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an ǫ -stationary point using O (ǫ − 2 ) and ˜ O ( ǵ − 1 ) samples for the stochastics and the deterministic setting, respectively is proposed.
Efficient and Modular Implicit Differentiation
TLDR
This paper proposes automatic implicit differentiation, an e-cient and modular approach for implicit di-erentiation of optimization problems, and shows the ease of formulating and solving bi-level optimization problems using the framework.
Electromagnetic neural source imaging under sparsity constraints with SURE-based hyperparameter tuning
TLDR
This paper proposes to use a proxy of the Stein’s Unbiased Risk Estimator (SURE) to automatically select their regularization parameters and shows that the proposed SURE approach outperforms cross-validation strategies and state-of-the-art Bayesian statistics methods both computationally and statistically.
PUDLE: Implicit Acceleration of Dictionary Learning by Backpropagation
TLDR
The theoretical proof for these empirical results through PUDLE, a Provable Unfolded Dictionary LEarning method is offered, providing sufficient conditions on the network initialization and data distribution for model recovery and highlighting the interpretability of PULDE by deriving a mathematical relation between network weights, its output, and the training data.
Stable and Interpretable Unrolled Dictionary Learning
TLDR
PUDLE’s interpretability is demonstrated, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.

References

SHOWING 1-10 OF 127 REFERENCES
Convergence Properties of Stochastic Hypergradients
TLDR
This work provides iteration complexity bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation.
Hyperparameter optimization with approximate gradient
TLDR
This work proposes an algorithm for the optimization of continuous hyperparameters using inexact gradient information and gives sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors.
Optimizing Millions of Hyperparameters by Implicit Differentiation
TLDR
An algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations is proposed and used to train modern network architectures with millions of weights and millions of hyper-parameters.
On the Iteration Complexity of Hypergradient Computation
TLDR
A unified analysis is presented which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity, and suggests a hierarchy in terms of computational efficiency among the above methods.
Bilevel Programming for Hyperparameter Optimization and Meta-Learning
We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be
Differentiating the Value Function by using Convex Duality
TLDR
This work uses a well known result from convex duality theory to relax the conditions and to derive convergence rates of the derivative approximation for several classes of parametric optimization problems in Machine Learning.
A Bilevel Optimization Approach for Parameter Learning in Variational Models
TLDR
This work considers a class of image denoising models incorporating $\ell_p$-norm--based analysis priors using a fixed set of linear operators and devise semismooth Newton methods for solving the resulting nonsmooth bilevel optimization problems.
Gradient-Based Optimization of Hyperparameters
TLDR
This article presents a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameter gradient involving second derivatives of the training criterion.
Approximation Methods for Bilevel Programming
TLDR
An approximation algorithm is presented for solving a class of bilevel programming problem where the inner objective function is strongly convex and its finite-time convergence analysis under different convexity assumption on the outer objective function.
Gradient-based Hyperparameter Optimization through Reversible Learning
TLDR
This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.
...
...