Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
@article{Shoeybi2019MegatronLMTM, title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism}, author={M. Shoeybi and M. Patwary and R. Puri and P. LeGresley and J. Casper and Bryan Catanzaro}, journal={ArXiv}, year={2019}, volume={abs/1909.08053} }
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require… CONTINUE READING
Supplemental Code
Github Repo
Via Papers with Code
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Figures, Tables, and Topics from this paper
Paper Mentions
158 Citations
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Computer Science, Mathematics
- ArXiv
- 2020
- 28
- PDF
Memory-Efficient Pipeline-Parallel DNN Training
- Computer Science, Mathematics
- ArXiv
- 2020
- 1
- Highly Influenced
- PDF
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
- Computer Science
- ArXiv
- 2020
- Highly Influenced
- PDF
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
- Computer Science
- NeurIPS
- 2020
- Highly Influenced
- PDF
Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning
- Computer Science
- EMNLP
- 2020
- Highly Influenced
- PDF
DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning
- 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)
- 2020
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
- Computer Science
- ICLR
- 2020
- 898
- PDF
References
SHOWING 1-10 OF 51 REFERENCES
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- Computer Science
- NeurIPS
- 2019
- 305
- Highly Influential
- PDF
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
- Computer Science
- ICLR
- 2020
- 898
- PDF
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Computer Science, Mathematics
- ACL
- 2019
- 813
- Highly Influential
- PDF
Generating Long Sequences with Sparse Transformers
- Computer Science, Mathematics
- ArXiv
- 2019
- 225
- Highly Influential
- PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Computer Science
- NAACL-HLT
- 2019
- 13,726
- Highly Influential
- PDF
Empirical Evaluation and Combination of Advanced Language Modeling Techniques
- Computer Science
- INTERSPEECH
- 2011
- 287
- PDF
Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform
- Computer Science
- ArXiv
- 2018
- 24
- PDF