• Corpus ID: 220041771

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

  title={Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes},
  author={Shuai Zheng and Haibin Lin and Sheng Zha and Mu Li},
BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which result in long training time and impede development progress. Using stochastic gradient methods with large mini-batch has been advocated as an efficient tool to reduce the training time. Along this line of research, LAMB is a prominent example that reduces the… 

Figures and Tables from this paper

Normalization Techniques in Training DNNs: Methodology, Analysis and Application

A unified picture of the main motivation behind different approaches from the perspective of optimization is provided, and a taxonomy for understanding the similarities and differences between them is presented.

Context, Language Modeling, and Multimodal Data in Finance

The improvement in classification accuracy is material, suggesting that full text and context are important in classifying financial documents and that the benefits from the use of mixed data are feasible and fruitful in machine learning models in finance.

DynaMaR: Dynamic Prompt with Mask Token Representation

This paper proposes an improvement to prompt-based finetuning that addresses two issues that arise with the standard prompt approach and shows that DynaMaR can achieve an average improvement of 10% in few-shot settings and improvement of 3.7% in data-rich settings over the standardfine-tuning approach on four e-commerce applications.

Asynchronous Convergence in Multi-Task Learning via Knowledge Distillation from Converged Tasks

This work proposes a novel approach that avoids the problem of requiring all tasks to converge at the same rate, but rather allows for “asynchronous” convergence among the tasks where each task can converge on its own schedule.

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

MiCS, which Minimizes the Communication Scale to bring down communication overhead is proposed, which can utilize heterogeneous network bandwidth, reduce network traffic over slower links, reduce the latency of communications for maintaining high network bandwidth utilization, and amortize expensive global gradient synchronization overhead.

Optimizing Data Layout for Training Deep Neural Networks

A simple-yet-effective data layout arbitration framework that automatically picks up the beneficial data layout for different DNNs under different pruning schemes is proposed, built upon a formulated cache estimation model.

Language models for the prediction of SARS-CoV-2 inhibitors

This work pre-trained a deep learning language model (BERT) on ∼9.6 billion molecules and achieved peak performance of 603 petaflops in mixed precision, reducing pre-training time from days to hours, compared to previous efforts with this architecture, while also increasing the dataset size by nearly an order of magnitude.

Parallelizing DNN Training on GPUs: Challenges and Opportunities

The main challenges in adopting data parallelism and model parallelism on multi-GPU platforms are identified and a survey including recent research works targeting these challenges is conducted.

Compressed Communication for Distributed Training: Adaptive Methods and System

This paper introduces a novel adaptive gradient method with gradient compression that has a convergence rate of O(1/ √ T ) for non-convex problems and develops a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers.

Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning

This paper investigates the column balanced block-wise pruning on Transformer and designs an FPGA acceleration engine to customize the balanced blockwise matrix multiplication.



Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

The LAMB optimizer is proposed, which helps to scale the batch size to 65536 without losing accuracy, and is a general optimizer that works for both small and large batch sizes and does not need hyper-parameter tuning besides the learning rate.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

Optimization Methods for Large-Scale Machine Learning

A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.

Large Batch Training of Convolutional Networks

It is argued that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge and a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS) is proposed.

On the Convergence of Adam and Beyond

It is shown that one cause for such failures is the exponential moving average used in the algorithms, and suggested that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients.

On the importance of initialization and momentum in deep learning

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.


The normalized gradient methods having constant step size with occasionally decay, such as SGD with momentum, have better performance in the deep convolution neural networks, while those with adaptive step sizes perform better in recurrent neural networks.

Efficient mini-batch training for stochastic optimization

It is proved that the convergence rate does not decrease with increasing minibatch size, and with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

The results indicate the normalized gradient with adaptive step size can help accelerate the training of neural networks, and significant speedup can be observed if the networks are deep or the dependencies are long.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.