• Corpus ID: 246411250

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

  title={Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning},
  author={Lianmin Zheng and Zhuohan Li and Hao Zhang and Yonghao Zhuang and Zhifeng Chen and Yanping Huang and Yida Wang and Yuanzhong Xu and Danyang Zhuo and Joseph Gonzalez and Ion Stoica},
Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing… 
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization
This paper presents Unity, the first system that jointly optimizes algebraic transformations and parallelization in distributed DNN training. Unity represents both parallelization and algebraic
Decentralized Training of Foundation Models in Heterogeneous Environments
This paper presents the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network, and provides a formal cost model and an efficient evolutionary algorithm to find the optimal allocation strategy.


Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Wide Residual Networks
This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
JAX: composable transformations of Python+NumPy programs, 2018
  • 2018
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
This work identifies a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property, which enables a more fine-grained pipeline compared with previous work.
Efficient large-scale language model training on GPU clusters using megatron-LM
This paper proposes a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches and allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.
DAPPLE: a pipelined data parallel approach for training large models
DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models, is proposed, which features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline Parallelism.
ZeRO: Memory optimizations Toward Training Trillion Parameter Models
ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.
PipeDream: generalized pipeline parallelism for DNN training
PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.