• Corpus ID: 231839901

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

  title={PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers},
  author={Chaoyang He and Shen Li and Mahdi Soltanolkotabi and Salman Avestimehr},
The size of Transformer models is growing at an unprecedented rate. It has taken less than one year to reach trillion-level parameters since the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. In PipeTransformer, we… 
BAGUA: Scaling up Distributed Learning with System Relaxations
BAGUA is built, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training.
Pipeline Parallelism for Inference on Heterogeneous Edge Computing
This work proposes EdgePipe, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger (and more accurate) models that otherwise cannot fit on single edge devices.
End-to-end Adaptive Distributed Training on PaddlePaddle
This adaptive framework is equipped with a global cost model and a global planner, which can enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and elastic distributed training, and can satisfy various requirements from the diversity of applications and the heterogeneity of resources.
B: Scaling up Distributed Learning with System Relaxations
B is built, a MPI-style communication library, providing a collection of primitives that is both exible and modular to support state-of-the-art system relaxation techniques of distributed training, and a rigorous tradeo exploration shows that dierent algorithms and system relaxations achieve the best performance over dierent network conditions.
Efficient large-scale language model training on GPU clusters using megatron-LM
This paper proposes a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches and allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.
FLEET—Fast Lanes for Expedited Execution at 10 Terabits: Program Overview
Fleet provides a primarily off-the-shelf solution with high-end servers and shared computational and storage resources connected via PCIe over a reconfigurable MEMS optical switch that uses custom Optical NICs to allow arbitrary topologies to be configured before or even during execution.
Reservoir Transformers
Inspired by old and well-established ideas in machine learning, a variety of non-linear “reservoir” layers interspersed with regular transformer layers are explored, and improvements in wall-clock compute time until convergence are shown.
Subgraph Federated Learning with Missing Neighbor Generation
Two major techniques are proposed, which train a GraphSage model based on FedAvg to integrate node features, link structures, and task labels on multiple local subgraphs; and FedSage+, which trains a missing neighbor generator along FedSages to deal with missing links across local sub graphs.


PipeDream: generalized pipeline parallelism for DNN training
PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism
This paper investigates how to enable training of large DNN models on a heterogeneous GPU cluster that possibly includes whimpy GPUs that, as a standalone, could not be used for training, and proposes a novel parameter synchronization model, which is referred to as Wave Synchronous Parallel (WSP), to accommodate both PMP and DP for virtual workers.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.
Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
Parallax introduces a hybrid approach that combines Parameter Server and AllReduce architectures to optimize the amount of data transfer according to the sparsity of model parameters, and achieves scalable training throughput on both dense and sparse models while requiring little effort from its users.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.
Mesh-TensorFlow: Deep Learning for Supercomputers
Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
This work develops a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, achieving both memory efficiency and scaling efficiency, and demonstrates ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.