TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
@article{Li2021TeraPipeTP, title={TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models}, author={Zhuohan Li and Siyuan Zhuang and Shiyuan Guo and Danyang Zhuo and Hao Zhang and Dawn Xiaodong Song and Ion Stoica}, journal={ArXiv}, year={2021}, volume={abs/2102.07988} }
Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level…
Figures and Tables from this paper
21 Citations
Efficient large-scale language model training on GPU clusters using megatron-LM
- Computer ScienceSC
- 2021
This paper proposes a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches and allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.
Efficient Large-Scale Language Model Training on GPU Clusters
- Computer ScienceArXiv
- 2021
This work shows how to compose different types of parallelism methods (tensor, pipeline, and data paralleism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models the authors can efficiently train compared to existing systems.
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
- Computer ScienceArXiv
- 2021
Amazon SageMaker model parallelism is presented, a software library that integrates with PyTorch, and enables easy training of large models using model Parallelism and other memory-saving features, which evaluates performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering.
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- Computer ScienceArXiv
- 2022
Alpa automates model-parallel training of large deep learning models by generating execution plans that unify data, operator, and pipeline parallelism and generalizes to models with heterogeneous architectures and models without manually-designed plans.
PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management
- Computer ScienceArXiv
- 2021
PatrickStar reduces memory requirements of computing platforms by using the CPU-GPU heterogeneous memory space to store model data, consisting of parameters, gradients, and optimizer states, and manages model data in chunks, which are dynamically distributed in heterogeneousMemory spaces.
Pipeline Parallelism for Inference on Heterogeneous Edge Computing
- Computer ScienceArXiv
- 2021
This work proposes EdgePipe, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger (and more accurate) models that otherwise cannot fit on single edge devices.
Decentralized Training of Foundation Models in Heterogeneous Environments
- Computer ScienceArXiv
- 2022
This paper presents the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network, and provides a formal cost model and an efficient evolutionary algorithm to find the optimal allocation strategy.
Reducing Activation Recomputation in Large Transformer Models
- Computer ScienceArXiv
- 2022
This work presents two novel yet very simple techniques: sequence parallelism and selective activation recomputation, which almost eliminate the need to recompute activations in conjunction with tensor parallelism.
BAGUA: Scaling up Distributed Learning with System Relaxations
- Computer ScienceProc. VLDB Endow.
- 2021
BAGUA is built, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training.
B: Scaling up Distributed Learning with System Relaxations
- Computer Science
B is built, a MPI-style communication library, providing a collection of primitives that is both exible and modular to support state-of-the-art system relaxation techniques of distributed training, and a rigorous tradeo exploration shows that dierent algorithms and system relaxations achieve the best performance over dierent network conditions.
References
SHOWING 1-10 OF 37 REFERENCES
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- Computer ScienceArXiv
- 2019
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- Computer ScienceNeurIPS
- 2019
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.
Beyond Data and Model Parallelism for Deep Neural Networks
- Computer ScienceMLSys
- 2019
A more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions is defined and FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine is proposed.
DAPPLE: a pipelined data parallel approach for training large models
- Computer SciencePPoPP
- 2021
DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models, is proposed, which features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline Parallelism.
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
- Computer ScienceSC
- 2020
This work develops a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, achieving both memory efficiency and scaling efficiency, and demonstrates ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware.
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
- Computer ScienceArXiv
- 2018
Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data-parallel training.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
- Computer ScienceICML
- 2018
The experiments show that layer-wise parallelism outperforms current parallelization approaches by increasing training speed, reducing communication costs, achieving better scalability to multiple GPUs, while maintaining the same network accuracy.
Mesh-TensorFlow: Deep Learning for Supercomputers
- Computer ScienceNeurIPS
- 2018
Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.
Big Bird: Transformers for Longer Sequences
- Computer ScienceNeurIPS
- 2020
It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
Training Deep Nets with Sublinear Memory Cost
- Computer ScienceArXiv
- 2016
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.