Corpus ID: 11525196

Speed learning on the fly

  title={Speed learning on the fly},
  author={Pierre Mass{\'e} and Yann Ollivier},
The practical performance of online stochastic gradient descent algorithms is highly dependent on the chosen step size, which must be tediously hand-tuned in many applications. The same is true for more advanced variants of stochastic gradients, such as SAGA, SVRG, or AdaGrad. Here we propose to adapt the step size by performing a gradient descent on the step size itself, viewing the whole performance of the learning trajectory as a function of step size. Importantly, this adaptation can be… Expand
Learning with Random Learning Rates
Alrao could save time when testing deep learning models: a range of models could be quickly assessed with Alrao, and the most promising models could then be trained more extensively. Expand
Autour de L'Usage des gradients en apprentissage statistique. (Around the Use of Gradients in Machine Learning)
A local convergence theorem is proved for the classical dynamical system optimisation algorithm called RTRL, in a non linear setting, and the LLR algorithm is formalised, by replacing these informations by a non biased, low dimension, random approximation. Expand
2 All Learning Rates At Once : Description
Hyperparameter tuning is a bothersome step in the training of deep learning models. One of the most sensitive hyperparameters is the learning rate of the gradient descent. We present the All LearningExpand
Barzilai-Borwein Step Size for Stochastic Gradient Descent
The Barzilai-Borwein (BB) method is proposed to be used to automatically compute step sizes for SGD and its variant: stochastic variance reduced gradient (SVRG) method, which leads to two algorithms: SGD-BB and SVRG-BB, which is superior to some advanced SGD variants. Expand
Nesterov's accelerated gradient and momentum as approximations to regularised update descent
It is shown that a new algorithm, which is term Regularised Gradient Descent, can converge more quickly than either Nesterov's algorithm or the classical momentum algorithm. Expand
Variance-based stochastic extragradient methods with line search for stochastic variational inequalities
A dynamic sampled stochastic approximated (DS-SA) extragradient method for stochastic variational inequalities (SVI) is proposed that is \emph{robust} with respect to an unknown Lipschitz constantExpand
Data-driven algorithm selection and tuning in optimization and signal processing
The goal is to train machine learning methods to automatically improve the performance of optimization and signal processing algorithms and shows that there exists a learning algorithm that, with high probability, will select the algorithm that optimizes the average performance on an input set of problem instances with a given distribution. Expand
Data-driven Algorithm Selection and Parameter Tuning: Two Case studies in Optimization and Signal Processing
The goal is to train machine learning methods to automatically improve the performance of optimization and signal processing algorithms and uses this approach to improve two popular data processing subroutines in data science: stochastic gradient descent and greedy methods in compressed sensing. Expand
Autour De L'Usage des gradients en apprentissage statistique
Nous etablissons un theoreme de convergence locale de l'algorithme classique d'optimisation de systeme dynamique RTRL, applique a un systeme non lineaire. L'algorithme RTRL est un algorithme enExpand


No more pesky learning rates
The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems. Expand
Large-Scale Machine Learning with Stochastic Gradient Descent
A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. Expand
Gradient-based Hyperparameter Optimization through Reversible Learning
This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. Expand
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive. Expand
Natural Gradient Works Efficiently in Learning
  • S. Amari
  • Computer Science, Mathematics
  • Neural Computation
  • 1998
The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. Expand
Minimizing finite sums with the stochastic average gradient
Numerical experiments indicate that the new SAG method often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies. Expand
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
This work introduces a new optimisation method called SAGA, which improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Expand
A Stochastic Approximation Method
Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown tot he experiment, and it is desire to find theExpand
Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26 of Applied Mathematical Sciences
  • Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26 of Applied Mathematical Sciences
  • 1978