• Corpus ID: 231934213

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

@article{Li2021TeraPipeTP,
  title={TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models},
  author={Zhuohan Li and Siyuan Zhuang and Shiyuan Guo and Danyang Zhuo and Hao Zhang and Dawn Xiaodong Song and Ion Stoica},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.07988}
}
Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level… 

Figures and Tables from this paper

Efficient large-scale language model training on GPU clusters using megatron-LM
TLDR
This paper proposes a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches and allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.
Efficient Large-Scale Language Model Training on GPU Clusters
TLDR
This work shows how to compose different types of parallelism methods (tensor, pipeline, and data paralleism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models the authors can efficiently train compared to existing systems.
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
TLDR
Amazon SageMaker model parallelism is presented, a software library that integrates with PyTorch, and enables easy training of large models using model Parallelism and other memory-saving features, which evaluates performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering.
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
TLDR
Alpa automates model-parallel training of large deep learning models by generating execution plans that unify data, operator, and pipeline parallelism and generalizes to models with heterogeneous architectures and models without manually-designed plans.
PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management
TLDR
PatrickStar reduces memory requirements of computing platforms by using the CPU-GPU heterogeneous memory space to store model data, consisting of parameters, gradients, and optimizer states, and manages model data in chunks, which are dynamically distributed in heterogeneousMemory spaces.
Pipeline Parallelism for Inference on Heterogeneous Edge Computing
TLDR
This work proposes EdgePipe, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger (and more accurate) models that otherwise cannot fit on single edge devices.
Decentralized Training of Foundation Models in Heterogeneous Environments
TLDR
This paper presents the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network, and provides a formal cost model and an efficient evolutionary algorithm to find the optimal allocation strategy.
Reducing Activation Recomputation in Large Transformer Models
TLDR
This work presents two novel yet very simple techniques: sequence parallelism and selective activation recomputation, which almost eliminate the need to recompute activations in conjunction with tensor parallelism.
BAGUA: Scaling up Distributed Learning with System Relaxations
TLDR
BAGUA is built, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training.
B: Scaling up Distributed Learning with System Relaxations
TLDR
B is built, a MPI-style communication library, providing a collection of primitives that is both exible and modular to support state-of-the-art system relaxation techniques of distributed training, and a rigorous tradeo exploration shows that dierent algorithms and system relaxations achieve the best performance over dierent network conditions.
...
...

References

SHOWING 1-10 OF 37 REFERENCES
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
TLDR
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
TLDR
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.
Beyond Data and Model Parallelism for Deep Neural Networks
TLDR
A more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions is defined and FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine is proposed.
DAPPLE: a pipelined data parallel approach for training large models
TLDR
DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models, is proposed, which features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline Parallelism.
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
TLDR
This work develops a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, achieving both memory efficiency and scaling efficiency, and demonstrates ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware.
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
TLDR
Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data-parallel training.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
TLDR
The experiments show that layer-wise parallelism outperforms current parallelization approaches by increasing training speed, reducing communication costs, achieving better scalability to multiple GPUs, while maintaining the same network accuracy.
Mesh-TensorFlow: Deep Learning for Supercomputers
TLDR
Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.
Big Bird: Transformers for Longer Sequences
TLDR
It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
Training Deep Nets with Sublinear Memory Cost
TLDR
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.
...
...