• Corpus ID: 245334345

Efficient Large Scale Language Modeling with Mixtures of Experts

@article{Artetxe2021EfficientLS,
  title={Efficient Large Scale Language Modeling with Mixtures of Experts},
  author={Mikel Artetxe and Shruti Bhosale and Naman Goyal and Todor Mihaylov and Myle Ott and Sam Shleifer and Xi Victoria Lin and Jingfei Du and Srini Iyer and Ramakanth Pasunuru and Giridhar Anantharaman and Xian Li and Shuohui Chen and Halil Akın and Mandeep Baines and Louis Martin and Xing Zhou and Punit Singh Koura and Brian O'Horo and Jeff Wang and Luke Zettlemoyer and Mona Diab and Zornitsa Kozareva and Ves Stoyanov},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.10684}
}
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: inand out-of-domain language modeling, zeroand few-shot priming, and full finetuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the… 
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals
TLDR
This work conducts a comprehensive empirical study, and proposes a recipe, namely “Model generated dEnoising TRaining Objective” (METRO), which incorporates some of the best modeling techniques developed recently to speed up, stabilize, and enhance pretrained language models without compromising model effectiveness.
On the Representation Collapse of Sparse Mixture of Experts
TLDR
This work proposes to estimate the routing scores between tokens and experts on a low-dimensional hypersphere and achieves more consistent routing than the baseline mixture-of-experts methods.
Training Compute-Optimal Large Language Models
TLDR
This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
TLDR
DeepSpeed-MoE is presented, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions.
Unified Scaling Laws for Routed Language Models
TLDR
This work derives and justifies scaling laws defined on parameter count and computational requirement which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques.
A Survey on Dynamic Neural Networks for Natural Language Processing
TLDR
This survey summarizes progress of three types of dynamic neural networks in NLP: skimming, mixture of experts, and early exit and highlights current challenges in dynamic neural Networks and directions for future research.
Improving In-Context Few-Shot Learning via Self-Supervised Training
TLDR
This paper proposes to use self-supervision in an intermediate training stage between pretraining and downstream few-shot usage with the goal to teach the model to per-form in-context few shot learning.
Autoregressive Search Engines: Generating Substrings as Document Identifiers
TLDR
This work proposes an alternative that doesn’t force any structure in the search space: using all ngrams in a passage as its possible identifier, which not only outperforms prior autoregressive approaches but also leads to an average improvement over more established retrieval solutions for passage-level retrieval on the KILT benchmark.
One Student Knows All Experts Know: From Sparse to Dense
TLDR
This work proposes a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE, and proposes Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
TLDR
GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated than sim-ilarly sized GPT-3 and FairSeq models, and its performance on a range of language-understanding, mathematics, and knowledge-based tasks is evaluated.
...
1
2
...

References

SHOWING 1-10 OF 78 REFERENCES
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Adaptive Input Representations for Neural Language Modeling
TLDR
Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
TLDR
The proposed training techniques mitigate the instabilities, and it is shown large sparse models may be trained, for the first time, with lower precision (bfloat16) formats and achieve a 4x speedup over the T5-XXL model.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
TLDR
This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
TLDR
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.
BASE Layers: Simplifying Training of Large, Sparse Models
TLDR
A new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers and improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses.
...
1
2
3
4
5
...