Corpus ID: 231924993

Learning by Turning: Neural Architecture Aware Optimisation

@article{Liu2021LearningBT,
  title={Learning by Turning: Neural Architecture Aware Optimisation},
  author={Y. Liu and J. Bernstein and M. Meister and Yisong Yue},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.07227}
}
Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 49 REFERENCES
Centered Weight Normalization in Accelerating Training of Deep Neural Networks
  • 36
  • Highly Influential
  • PDF
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  • 21,978
  • Highly Influential
  • PDF
On the Convergence of Adam and Beyond
  • 1,051
  • PDF
The Marginal Value of Adaptive Gradient Methods in Machine Learning
  • 558
  • PDF
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
  • 9,318
  • PDF
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
  • 153
  • Highly Influential
  • PDF
Attention is All you Need
  • 16,969
  • PDF
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
  • 1,515
  • Highly Influential
  • PDF
...
1
2
3
4
5
...