• Corpus ID: 237593026

Scalable and Efficient MoE Training for Multitask Multilingual Models

@article{Kim2021ScalableAE,
  title={Scalable and Efficient MoE Training for Multitask Multilingual Models},
  author={Young Jin Kim and Ammar Ahmad Awan and Alexandre Muzio and Andr{\'e}s Felipe Cruz-Salinas and Liyang Lu and Amr Hendy and Samyam Rajbhandari and Yuxiong He and Hany Hassan Awadalla},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.10465}
}
The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and… 
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
TLDR
DeepSpeed-MoE is presented, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
TLDR
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System
TLDR
SE-MoE is presented that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types of models and for scalable inference in a single node.
Tutel: Adaptive Mixture-of-Experts at Scale
TLDR
TUTEL efficiently and effectively runs a real-world MoE-based model named SwinV2MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture, and achieves superior accuracy in both pre-training and down-stream computer vision tasks than the counterpart dense model, indicating the readiness of TUTEL for end-toend real- world model training and inference.
Designing Effective Sparse Expert Models
TLDR
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
TLDR
By incorporating the proposed Conditional MoEs, the recently proposed generalist model Uni-Perceiver can effectively mitigate the interference across tasks and modalities, and achieves state-of-the-art results on a series of downstream tasks via prompt tuning on 1% of downstream data.
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
TLDR
Results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system are presented.
Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition
TLDR
This paper proposes a novel algorithm called SimAdapter for explicitly learning knowledge from adapters for parameter-efficient cross-lingual speech adaptation and shows that these two novel algorithms can be integrated for better performance with up to 3.55% relative WER reduction.
Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers
TLDR
Gating Dropout is proposed, which allows tokens to ignore the gating network and stay at their local machines, thus reducing the cross-machine communication and has a regularization effect during training, resulting in improved generalization performance.
Taming Sparsely Activated Transformer with Stochastic Experts
TLDR
ThOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions, and significantly outperform the Transformer and MoE models across various settings.
...
...

References

SHOWING 1-10 OF 29 REFERENCES
Multi-task Learning for Multilingual Neural Machine Translation
TLDR
This work proposes a multi-task learning (MTL) framework that jointly trains the model with the translation task on bitext data and two denoising tasks on the monolingual data, and shows the effectiveness of MTL over pre-training approaches for both NMT and cross-lingual transfer learning NLU tasks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
TLDR
This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.
ZeRO: Memory optimizations Toward Training Trillion Parameter Models
TLDR
ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency.
Efficient Large-Scale Language Model Training on GPU Clusters
TLDR
This work shows how to compose different types of parallelism methods (tensor, pipeline, and data paralleism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models the authors can efficiently train compared to existing systems.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
TLDR
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Multilingual Denoising Pre-training for Neural Machine Translation
Abstract This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
TLDR
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
...
...