Corpus ID: 233204537

Efficient Large-Scale Language Model Training on GPU Clusters

  title={Efficient Large-Scale Language Model Training on GPU Clusters},
  author={D. Narayanan and M. Shoeybi and J. Casper and P. LeGresley and M. Patwary and V. Korthikanti and Dmitri Vainbrand and Prethvi Kashinkunti and J. Bernauer and B. Catanzaro and Amar Phanishayee and M. Zaharia},
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train these models can result in unrealistically long training times. New methods of model parallelism such as tensor and pipeline parallelism have been… Expand
Doing more with less: training large DNN models on commodity servers for the masses
CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning
Sequence Parallelism: Making 4D Parallelism Possible
Distributed Deep Learning in Open Collaborations
Secure Distributed Training at Scale


Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Mesh-TensorFlow: Deep Learning for Supercomputers
Beyond Data and Model Parallelism for Deep Neural Networks
ZeRO-Offload: Democratizing Billion-Scale Model Training
DAPPLE: a pipelined data parallel approach for training large models
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
PyTorch: An Imperative Style, High-Performance Deep Learning Library