• Corpus ID: 248496391

ST-MoE: Designing Stable and Transferable Sparse Expert Models

  title={ST-MoE: Designing Stable and Transferable Sparse Expert Models},
  author={Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},
Scale has opened new frontiers in natural language processing – but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a… 

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

This work presents the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning, and proposes an entropy-based regularization scheme for which it is demonstrated remarkable performance improvement over dense models of equivalent computational cost.



DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

DSelect-k is developed: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation, that can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select.

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

This work investigates routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation and suggests that task-level routing ( task-MoE ) en-ables us to extract smaller, ready-to-deploy sub-networks from large sparse models.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

This paper proposes and develops a family of language models named GLaM, which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.

Scalable and Efficient MoE Training for Multitask Multilingual Models

A system capable of scaling MoE models efficiently to trillions of parameters is developed that combines multidimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Parameter-Efficient Transfer Learning for NLP

To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.