Corpus ID: 220347631

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

@inproceedings{Schmidt2021DescendingTA,
  title={Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers},
  author={Robin M. Schmidt and Frank Schneider and Philipp Hennig},
  booktitle={ICML},
  year={2021}
}
Choosing the optimizer is among the most crucial decisions of deep learning engineers, and it is not an easy one. The growing literature now lists literally hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often done according to personal anecdotes. In this work, we aim to replace these anecdotes, if not with evidence, then at least with heuristics. To do so, we perform an extensive, standardized benchmark of more… Expand
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes
TLDR
It is demonstrated that standard optimization algorithms such as Nesterov momentum and Adam can match or exceed the results of LARS and LAMB at large batch sizes and shed light on the difficulties of comparing optimizers for neural network training more generally. Expand
A fast point solver for deep nonlinear function approximators
Deep kernel processes (DKPs) generalise Bayesian neural networks, but do not require us to represent either features or weights. Instead, at each hidden layer they represent and optimize a flexibleExpand
Accelerating Federated Learning with a Global Biased Optimiser
Federated Learning (FL) is a recent development in the field of machine learning that collaboratively trains models without the training data leaving client devices, in order to preserveExpand
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization
TLDR
AdaMomentum is proposed as a new optimizer reaching the goal of training faster while generalizing better and develops a theory to back up the improvement in optimization and generalization and provide convergence guarantee under both convex and nonconvex settings. Expand
Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization
TLDR
MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. Expand
Deep learning in electron microscopy
TLDR
This review paper offers a practical perspective aimed at developers with limited familiarity of deep learning in electron microscopy that discusses hardware and software needed to get started with deep learning and interface with electron microscopes. Expand
Explainability-aided Domain Generalization for Image Classification
TLDR
Of thesis entitled Explainability-aided Domain Generalization for Image Classification explains how domain generalization can be applied to image classification problems. Expand
FCM-RDpA: TSK Fuzzy Regression Model Construction Using Fuzzy C-Means Clustering, Regularization, DropRule, and Powerball AdaBelief
TLDR
FCM-RDpA, which improves MBGD-RDA by replacing the grid partition approach in rule initialization by fuzzy c-means clustering, and AdaBound by Powerball AdaBelief, which integrates recently proposed Powerball gradient and AdaBelieve to further expedite and stabilize parameter optimization are proposed. Expand
Inverse-Dirichlet Weighting Enables Reliable Training of Physics Informed Neural Networks
TLDR
This paper presents a meta-modelling architecture suitable for Scalable Data Analytics and Artificial Intelligence ScaDS.AI and aims to demonstrate the architecture’s applicability in the rapidly changing environment. Expand
Meta-Learning Bidirectional Update Rules
TLDR
This paper introduces a new type of generalized neural network where neurons and synapses maintain multiple states and shows that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 185 REFERENCES
Second-order step-size tuning of SGD for non-convex optimization
TLDR
A new stochastic first-order method (Step-Tuned SGD) is obtained which can be seen as a Stochastic version of the classical Barzilai-Borwein method, yielding better results than SGD, RMSprop, or ADAM. Expand
Optimizer Benchmarking Needs to Account for Hyperparameter Tuning
TLDR
Evaluating a variety of optimizers on an extensive set of standard datasets and architectures, the results indicate that Adam is the most practical solution, particularly in low-budget scenarios. Expand
On Empirical Comparisons of Optimizers for Deep Learning
TLDR
In experiments, it is found that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons, and that the popular adaptive gradient methods never underperform momentum or gradient descent. Expand
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Adaptive gradient methods for stochastic optimization adjust the learning rate for each parameter locally. However, there is also a global learning rate which must be tuned in order to get the bestExpand
AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients
TLDR
AdaBelief is proposed to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability; it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Expand
Lookahead Optimizer: k steps forward, 1 step back
TLDR
Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost, and can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings. Expand
On the Convergence of Adam and Beyond
TLDR
It is shown that one cause for such failures is the exponential moving average used in the algorithms, and suggested that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients. Expand
DeepOBS: A Deep Learning Optimizer Benchmark Suite
TLDR
DeepOBS is presented, a Python package of deep learning optimization benchmarks that addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. Expand
A Generalizable Approach to Learning Optimizers
TLDR
This work describes a system designed from a generalization-first perspective, learning to update optimizer hyperparameters instead of model parameters directly using novel features, actions, and a reward function that outperforms Adam at all neural network tasks including on modalities not seen during training. Expand
...
1
2
3
4
5
...