• Corpus ID: 246017095

Sequence Parallelism: Long Sequence Training from System Perspective

  title={Sequence Parallelism: Long Sequence Training from System Perspective},
  author={Shenggui Li and Fuzhao Xue and Yongbin Li and Yang You},
Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm perspective. In this work, we propose sequence parallelism, a memory-efficient parallelism method to help us break input sequence length limitation and train with longer sequences on GPUs efficiently. Our approach is compatible with most existing parallelisms… 

Figures and Tables from this paper

Reducing Activation Recomputation in Large Transformer Models
This work presents two novel yet very simple techniques: sequence parallelism and selective activation recomputation, which almost eliminate the need to recompute activations in conjunction with tensor parallelism.


Linformer: Self-Attention with Linear Complexity
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.
DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.
GSPMD: General and Scalable Parallelization for ML Computation Graphs
GSPMD allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.
ZeRO-Offload: Democratizing Billion-Scale Model Training
ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU, and combines compute and memory efficiency with ease-of-use.
BERT Representations for Video Question Answering
This work proposes to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips to capture the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer.