• Corpus ID: 231573431

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

@article{Fedus2021SwitchTS,
  title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  author={William Fedus and Barret Zoph and Noam M. Shazeer},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.03961}
}
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example. The result is a sparsely-activated model—with an outrageous number of parameters—but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs, and training instability. We address these with the introduction of the… 
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
TLDR
This work designs a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed, and implements and integrates the above optimizations as a general system, FasterMoE, empowering efficient distributed MoE model training.
A Practical Survey on Faster and Lighter Transformers
TLDR
This survey investigates popular approaches to make the Transformer faster and lighter and provides a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions to meet the desired trade-off between capacity, computation, and memory.
Sparse is Enough in Scaling Transformers
TLDR
This work proposes Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as the authors scale up the model size.
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts
TLDR
SpeechMoE, a MoE based model for speech recognition, can achieve lower character error rate (CER) with comparable computation cost than traditional static networks, and a new router architecture is used in this work which can simultaneously utilize the information from a shared embedding network and the hierarchical representation of different MoE layers.
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
TLDR
This paper demonstrates a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days, and provides a technique, Granular CPU offloading, to manage CPU memory for training large model and maintain high GPU utilities.
MoEfication: Conditional Computation of Transformer Models for Efficient Inference
TLDR
This work proposes to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication, to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
TLDR
DSelect-k is developed: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation, that can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select.
M6-T: Exploring Sparse Expert Models and Beyond
TLDR
This work investigates several key factors in sparse expert models and proposes a simple method called expert prototyping that improves the model quality but maintains constant computational costs, and further exploration on extremely large-scale models reflects that it is more effective in training larger models.
Prune Once for All: Sparse Pre-Trained Language Models
TLDR
This work presents a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation, and shows the best compression-to-accuracy ratio for BERT-Base, Bert-Large, and DistilBERT.
Exploring Sparse Expert Models and Beyond
TLDR
This work investigates several key factors in sparse expert models and proposes a simple method called expert prototyping that improves the model quality but maintains constant computational costs, and further exploration on extremely large-scale models reflects that it is more effective in training larger models.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 63 REFERENCES
Generating Long Sequences with Sparse Transformers
TLDR
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
TLDR
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
Scalable Transfer Learning with Expert Models
TLDR
This work trains a diverse set of experts by exploiting existing label structures, and uses cheap-to-compute performance proxies to select the relevant expert for each target task, and provides an adapter-based architecture able to compress many experts into a single model.
Reformer: The Efficient Transformer
TLDR
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences.
Exponentially Increasing the Capacity-to-Computation Ratio for Conditional Computation in Deep Learning
TLDR
A novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation is proposed, which is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
TLDR
This work develops a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, achieving both memory efficiency and scaling efficiency, and demonstrates ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware.
Mixed Precision Training
TLDR
This work introduces a technique to train deep neural networks using half precision floating point numbers, and demonstrates that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
TLDR
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
Adaptively Sparse Transformers
TLDR
This work introduces the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns, accomplished by replacing softmax with alpha-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight.
...
1
2
3
4
5
...