Corpus ID: 220041771

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

  title={Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes},
  author={Shuai Zheng and Haibin Lin and Sheng Zha and M. Li},
  • Shuai Zheng, Haibin Lin, +1 author M. Li
  • Published 2020
  • Computer Science, Mathematics
  • ArXiv
  • BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which result in long training time and impede development progress. Using stochastic gradient methods with large mini-batch has been advocated as an efficient tool to reduce the training time. Along this line of research, LAMB is a prominent example that reduces the… CONTINUE READING
    1 Citations

    Figures and Tables from this paper

    Normalization Techniques in Training DNNs: Methodology, Analysis and Application
    • 2
    • PDF


    Reducing BERT Pre-Training Time from 3 Days to 76 Minutes
    • 79
    • Highly Influential
    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
    • 1,459
    • PDF
    Optimization Methods for Large-Scale Machine Learning
    • 1,209
    • Highly Influential
    • PDF
    Large Batch Training of Convolutional Networks
    • 178
    • PDF
    On the Convergence of Adam and Beyond
    • 1,017
    • PDF
    On the importance of initialization and momentum in deep learning
    • 2,753
    • Highly Influential
    • PDF
    Efficient mini-batch training for stochastic optimization
    • 421
    • PDF
    Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training
    • 14
    • PDF
    Large Scale Distributed Deep Networks
    • 2,541
    • PDF