Corpus ID: 16925610

Learning Gradient Descent: Better Generalization and Longer Horizons

@inproceedings{Lv2017LearningGD,
  title={Learning Gradient Descent: Better Generalization and Longer Horizons},
  author={Kaifeng Lv and Shunhua Jiang and J. Li},
  booktitle={ICML},
  year={2017}
}
Training deep neural networks is a highly nontrivial task, involving carefully selecting appropriate training algorithms, scheduling step sizes and tuning other hyperparameters. Trying different combinations can be quite labor-intensive and time consuming. Recently, researchers have tried to use deep learning algorithms to exploit the landscape of the loss function of the training problem of interest, and learn how to optimize over it in an automatic way. In this paper, we propose a new… Expand
Meta-Learner with Sparsified Backpropagation
TLDR
An application of meProp is proposed to the learning-to-learn models to focus on learning of the most significant parameters which are consciously chosen and improvement in accuracy is demonstrated with the proposed technique. Expand
Neural Optimizers with Hypergradients for Tuning Parameter-Wise Learning Rates
Recent studies show that LSTM-based neural optimizers are competitive with state-of-theart hand-designed optimization methods for short horizons. Existing neural optimizers learn how to update theExpand
Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves
TLDR
This work introduces a new, neural network parameterized, hierarchical optimizer with access to additional features such as validation loss to enable automatic regularization and shows evidence of being useful for out of distribution tasks such as training themselves from scratch. Expand
Training more effective learned optimizers
Much as replacing hand-designed features with learned functions has revolutionized how we solve perceptual tasks, we believe learned algorithms will transform how we train models. In this work weExpand
HyperAdam: A Learnable Task-Adaptive Adam for Network Training
TLDR
A new optimizer, dubbed as HyperAdam, is proposed that combines the idea of "learning to optimize" and traditional Adam optimizer and is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM. Expand
OVERCOMING BARRIERS TO THE TRAINING OF EFFEC- TIVE LEARNED OPTIMIZERS
  • 2020
In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters. We introduce a new, neural network parameterized,Expand
Using a thousand optimization tasks to learn hyperparameter search strategies
TLDR
The diversity of the TaskSet and the method for learning hyperparameter lists are used to empirically explore the generalization of these lists to new optimization tasks in a variety of settings including ImageNet classification with Resnet50 and LM1B language modeling with transformers. Expand
Training Stronger Baselines for Learning to Optimize
TLDR
This work presents a progressive training scheme to gradually increase the optimizer unroll length, to mitigate a well-known L2O dilemma of truncation bias (shorter unrolling) versus gradient explosion (longer unrolling), and uses off-policy imitation learning to guide the L2o learning, by taking reference to the behavior of analytical optimizers. Expand
Understanding and correcting pathologies in the training of learned optimizers
TLDR
This work proposes a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance, allowing us to train neural networks to perform optimization of a specific task faster than tuned first-order methods. Expand
Meta-LR-Schedule-Net: Learned LR Schedules that Scale and Generalize
TLDR
This work designs a meta-learner with explicit mapping formulation to parameterize LR schedules, which can adjust LR adaptively to comply with current training dynamic by leveraging the information from past training histories. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Learning to Learn without Gradient Descent by Gradient Descent
TLDR
It is shown that recurrent neural network optimizers trained on simple synthetic functions by gradient descent exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Expand
Learning to learn by gradient descent by gradient descent
TLDR
This paper shows how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Expand
Learning to Learn for Global Optimization of Black Box Functions
TLDR
This work uses a large set of smooth target functions to learn a recurrent neural network (RNN) optimizer, which is either a long-short term memory network or a differentiable neural computer. Expand
Learning to Learn Using Gradient Descent
TLDR
This paper makes meta- learning in large systems feasible by using recurrent neural networks with attendant learning routines as meta-learning systems and shows that the approach to gradient descent methods forms non-stationary time series prediction. Expand
Using Deep Q-Learning to Control Optimization Hyperparameters
  • S. Hansen
  • Mathematics, Computer Science
  • ArXiv
  • 2016
TLDR
A novel definition of the reinforcement learning state, actions and reward function that allows a deep Q-network to learn to control an optimization hyperparameter is presented and it is shown that the DQN's q-values associated with optimal action converge and that the Q-gradient descent algorithms outperform gradient descent with an Armijo or nonmonotone line search. Expand
Learning to reinforcement learn
TLDR
This work introduces a novel approach to deep meta-reinforcement learning, which is a system that is trained using one RL algorithm, but whose recurrent dynamics implement a second, quite separate RL procedure. Expand
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
TLDR
The "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies and significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. Expand
ADADELTA: An Adaptive Learning Rate Method
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computationalExpand
...
1
2
3
4
...