• Corpus ID: 248496391

ST-MoE: Designing Stable and Transferable Sparse Expert Models

  title={ST-MoE: Designing Stable and Transferable Sparse Expert Models},
  author={Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},
Scale has opened new frontiers in natural language processing – but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a… 
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
This work presents the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning, and proposes an entropy-based regularization scheme for which it is demonstrated remarkable performance improvement over dense models of equivalent computational cost.


DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
DSelect-k is developed: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation, that can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select.
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
This work investigates routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation and suggests that task-level routing ( task-MoE ) en-ables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
This paper proposes and develops a family of language models named GLaM, which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
Scalable and Efficient MoE Training for Multitask Multilingual Models
A system capable of scaling MoE models efficiently to trillions of parameters is developed that combines multidimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
A hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters is trained, which is the largest Chinese dense pre-trained model so far and outperforms the state-of-the-art models on 68 NLP datasets.