Corpus ID: 197935378

Lookahead Optimizer: k steps forward, 1 step back

@inproceedings{Zhang2019LookaheadOK,
  title={Lookahead Optimizer: k steps forward, 1 step back},
  author={Michael Ruogu Zhang and James Lucas and Geoffrey E. Hinton and Jimmy Ba},
  booktitle={NeurIPS},
  year={2019}
}
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. [...] Key Method Intuitively, the algorithm chooses a search direction by \emph{looking ahead} at the sequence of "fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance…Expand
Slowing Down the Weight Norm Increase in Momentum-based Optimizers
Adaptive Multi-level Hyper-gradient Descent
Adaptive Learning Rate and Momentum for Training Deep Neural Networks
Taming GANs with Lookahead
Ranger21: a synergistic deep learning optimizer
Iterate Averaging Helps: An Alternative Perspective in Deep Learning
Adaptive Learning Rates with Maximum Variation Averaging
CProp: Adaptive Learning Rate Scaling from Past Gradient Conformity
Training Stronger Baselines for Learning to Optimize
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 54 REFERENCES
Averaging Weights Leads to Wider Optima and Better Generalization
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
On the importance of initialization and momentum in deep learning
Fixing Weight Decay Regularization in Adam
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Adam: A Method for Stochastic Optimization
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Nonlinear Acceleration of Deep Neural Networks
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
Regularizing and Optimizing LSTM Language Models
...
1
2
3
4
5
...