• Corpus ID: 231918801

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

  title={A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes},
  author={Zachary Nado and Justin Gilmer and Christopher J. Shallue and Rohan Anil and George E. Dahl},
Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether LARS and LAMB have any benefit over traditional, generic algorithms. In this work we demonstrate… 

Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

Zachary Nado znado@google.com∗ Neil Band neil.band@cs.ox.ac.uk† Mark Collier markcollier@google.com∗ Josip Djolonga josipd@google.com∗ Michael W. Dusenberry dusenberrymw@google.com∗ Sebastian

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

The ADAptive Nesterov momentum algorithm is proposed, Adan for short, to effec-tively speedup the training of deep neural networks and surpasses the corresponding SoTA optimizers on both CNNs and transformers, and sets new SoTAs for many popular networks and frameworks.

A high-resolution dynamical view on momentum methods for over-parameterized neural networks

Convergence analysis of HB and NAG in training an over-parameterized two-layer neural network with ReLU activation is provided, through the lens of high-resolution dynamical systems and neural tangent kernel (NTK) theory.


Inspired by the conditioning perspective, it is shown that learning rate warmup can improve training stability just as much as batchnormalization, layer normalization, MetaInit, GradInit, and Fixup initialization.

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

This paper proposes ScaLA, a novel and efficient method to accelerate the adaptation speed of pre-trained transformer networks that takes a sequential game-theoretic approach by adding lightweight adversarial noise into large-batch optimization, which significantly improves adaptation speed while preserving model generalization.

Predicting the utility of search spaces for black-box optimization: a simple, budget-aware approach

The goal of this work is to motivate the problem of predicting the quality of search spaces conditioned on budgets, as well as to provide a simple scoring method based on a utility function applied to a probabilistic response surface model, similar to Bayesian optimization.

Towards a robust out-of-the-box neural network model for genomic data

This work investigates the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets and identifies certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers.

On Large-Cohort Training for Federated Learning

This work explores how the number of clients sampled at each round (the cohort size) impacts the quality of the learned model and the training dynamics of federated learning algorithms.

Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning

  • Thomas Liao
  • Computer Science
    NeurIPS Datasets and Benchmarks
  • 2021
A meta-review of 107 survey papers from computer vision, natural language processing, recommender systems, reinforcement learning, graph processing, metric learning, and more is conducted, organizing a wide range of surprisingly consistent critique into a concrete taxonomy of observed failure modes.

Large-Scale Deep Learning Optimizations: A Comprehensive Survey

This survey aims to provide a clear sketch about the optimizations for large-scale deep learning with regard to the model accuracy and model efficiency, and investigates algorithms that are most commonly used for optimizing.



On Empirical Comparisons of Optimizers for Deep Learning

In experiments, it is found that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons, and that the popular adaptive gradient methods never underperform momentum or gradient descent.

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

An extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers is performed, identifying a significantly reduced subset of specific algorithms and parameter choices that generally provided competitive results in the authors' experiments.

Large Batch Training of Convolutional Networks

It is argued that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge and a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS) is proposed.

Disentangling Adaptive Gradient Methods from Learning Rates

A "grafting" experiment is introduced which decouples an update's magnitude from its direction, finding that many existing beliefs in the literature may have arisen from insufficient isolation of the implicit schedule of step sizes.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Understanding Short-Horizon Bias in Stochastic Meta-Optimization

Short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes, and is introduced as a toy problem, a noisy quadratic cost function, on which it is analyzed.

On the importance of initialization and momentum in deep learning

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Rethinking the Inception Architecture for Computer Vision

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.