Corpus ID: 202660670

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

@article{Shoeybi2019MegatronLMTM,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={M. Shoeybi and M. Patwary and R. Puri and P. LeGresley and J. Casper and Bryan Catanzaro},
  journal={ArXiv},
  year={2019},
  volume={abs/1909.08053}
}
  • M. Shoeybi, M. Patwary, +3 authors Bryan Catanzaro
  • Published 2019
  • Computer Science
  • ArXiv
  • Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require… CONTINUE READING
    158 Citations
    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
    • 28
    • PDF
    Memory-Efficient Pipeline-Parallel DNN Training
    • 1
    • Highly Influenced
    • PDF
    Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
    • Highly Influenced
    • PDF
    Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
    • Highly Influenced
    • PDF
    Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning
    • Highly Influenced
    • PDF
    DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning
    Transformers : State-ofthe-art Natural Language Processing
    • 23
    • PDF
    Emergent Properties of Finetuned Language Representation Models
    • 1
    • PDF
    XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
    • 8
    • PDF

    References

    SHOWING 1-10 OF 51 REFERENCES
    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
    • 305
    • Highly Influential
    • PDF
    Beyond Data and Model Parallelism for Deep Neural Networks
    • 101
    • PDF
    Dynamic Evaluation of Transformer Language Models
    • 18
    • PDF
    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
    • 813
    • Highly Influential
    • PDF
    Generating Long Sequences with Sparse Transformers
    • 225
    • Highly Influential
    • PDF
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    • 13,726
    • Highly Influential
    • PDF
    Empirical Evaluation and Combination of Advanced Language Modeling Techniques
    • 287
    • PDF
    Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform
    • 24
    • PDF
    Training Deep Nets with Sublinear Memory Cost
    • 250
    • Highly Influential
    • PDF