• Corpus ID: 6281930

Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters

@inproceedings{Luketina2016ScalableGT,
  title={Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters},
  author={Jelena Luketina and Tapani Raiko and Mathias Berglund and Klaus Greff},
  booktitle={ICML},
  year={2016}
}
Hyperparameter selection generally relies on running multiple full training trials, with selection based on validation set performance. We propose a gradient-based approach for locally adjusting hyperparameters during training of the model. Hyperparameters are adjusted so as to make the model parameter gradients, and hence updates, more advantageous for the validation cost. We explore the approach for tuning regularization hyperparameters and find that in experiments on MNIST, SVHN and CIFAR-10… 

Figures and Tables from this paper

Experiments With Scalable Gradient-based Hyperparameter Optimization for Deep Neural Networks by
TLDR
Some candidate completions of DrMAD, one such algorithm that updates the hyperparameters after fully training the parameters of the model, are explored, with experiments tuning per-parameter L2 regularization coefficients on CIFAR10 with the DenseNet architecture.
Fast Efficient Hyperparameter Tuning for Policy Gradients
TLDR
This paper proposes Hyperparameter Optimisation on the Fly (HOOF), a gradient-free algorithm that requires no more than one training run to automatically adapt the hyperparameter that affect the policy update directly through the gradient.
Efficient Hyperparameter Tuning with Dynamic Accuracy Derivative-Free Optimization
TLDR
This work applies a recent dynamic accuracy derivative-free optimization method to hyperparameter tuning, which allows inexact evaluations of the learning problem while retaining convergence guarantees, and shows its robustness and efficiency compared to a fixed accuracy approach.
Fast Efficient Hyperparameter Tuning for Policy Gradient Methods
TLDR
This paper proposes Hyperparameter Optimisation on the Fly (HOOF), a gradient-free algorithm that requires no more than one training run to automatically adapt the hyperparameter that affect the policy update directly through the gradient.
Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions
TLDR
This work aims to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which mapshyperparameters to optimal weights and biases, and outperforms competing hyperparameter optimization methods on large-scale deep learning problems.
Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation
TLDR
This work extends existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuoushyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts.
Online Hyperparameter Meta-Learning with Hypergradient Distillation
TLDR
This work parameterize a single Jacobian-vector product for each HO step and minimize the distance from the true second-order term with knowledge distillation, which allows online optimization and also is scalable to the hyperparameter dimension and the horizon length.
ING STRUCTURED BEST-RESPONSE FUNCTIONS
TLDR
This work aims to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which mapshyperparameters to optimal weights and biases, and outperforms competing hyperparameter optimization methods on large-scale deep learning problems.
Comprehensive analysis of gradient-based hyperparameter optimization algorithms
TLDR
The experiments show that the models optimized using the evidence lower bound give higher error rate than the models obtained using cross-validation and these models also show greater stability when data is noisy.
Gradient Descent: The Ultimate Optimizer
TLDR
This work proposes to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn thehyper-hyperparameters by gradient ascent as well, and so on ad infinitum.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
Gradient-Based Optimization of Hyperparameters
  • Yoshua Bengio
  • Computer Science, Mathematics
    Neural Computation
  • 2000
TLDR
This article presents a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameter gradient involving second derivatives of the training criterion.
Gradient-based Hyperparameter Optimization through Reversible Learning
TLDR
This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.
Efficient multiple hyperparameter learning for log-linear models
TLDR
This paper derives an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs).
Optimal use of regularization and cross-validation in neural network modeling
  • Dingding Chen, M. Hagan
  • Computer Science
    IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339)
  • 1999
TLDR
The results demonstrate that the SDVR framework is very promising for adaptive regularization and can be cost effectively applied to a variety of different problems.
Adaptive Regularization in Neural Network Modeling
TLDR
The idea is to minimize an empirical estimate of the generalization error with respect to regularization parameters by employing a simple iterative gradient descent scheme using virtually no additional programming overhead compared to standard training.
Practical Bayesian Optimization of Machine Learning Algorithms
TLDR
This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Algorithms for Hyper-Parameter Optimization
TLDR
This work contributes novel techniques for making response surface models P(y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Fast dropout training
TLDR
This work shows how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective, which gives an order of magnitude speedup and more stability.
...
1
2
3
4
...