• Corpus ID: 231573431

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  author={William Fedus and Barret Zoph and Noam M. Shazeer},
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer… 

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

This paper proposes a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow and introduces a novel scheduling module over the existing DNN runtime to monitor the data flow, make the scheduling plans, and dynamically adjust the model-to-hardware mapping guided by the real-time data traffic.

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

This work proposes a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse and explores the overlooked scalability bottleneck of SMoEs and leveraging it to effectively scale dense transformers.

Lita: Accelerating Distributed Training of Sparsely Activated Models

A new communication scheduling scheme based on tensor partitioning that prioritizes the all-to-all operations over other communication, due to its blocking nature, is proposed and introduced, and expert packing is introduced that reduces the all to-all transfer size and incorporates optimizations to mitigate its overheads.

FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models

This work designs a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed, and implements and integrates the above optimizations as a general system, FasterMoE, empowering efficient distributed MoE model training.

Alternating Updates for Efficient Transformers

This work introduces Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden, and presents extensions of AltUp to the sequence dimension, and demonstrates how it can be synergistically combined with existing approaches to obtain efficient models with even higher capacity.

A Practical Survey on Faster and Lighter Transformers

This survey investigates popular approaches to make Transformers faster and lighter and provides a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions to meet the desired trade-off between capacity, computation, and memory.

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Experimental results show that SpeechMoE can achieve lower character error rate (CER) with comparable computation cost than traditional static networks, providing 7.0%-23.0% relative CER improvements on four evaluation datasets.

Sparse is Enough in Scaling Transformers

This work proposes Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as the authors scale up the model size.

SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers

SPARTAN is proposed, a parameter efficient (PE) and computationally fast architecture for edge devices that adds hierarchically organized sparse memory after each Transformer layer, thus significantly reducing storage costs by re-using the PLM backbone for different tasks.

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Pseudo-to-Real is a simple training strategy for high-memory-footprint-required large models that is compatible with large models with architecture of sequential layers and demonstrates a practice of pretraining unprecedented 10-trillion-parameter model on solely 512 GPUs within 10 days.

Generating Long Sequences with Sparse Transformers

This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

Reformer: The Efficient Transformer

This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences.

Scalable Transfer Learning with Expert Models

This work trains a diverse set of experts by exploiting existing label structures, and uses cheap-to-compute performance proxies to select the relevant expert for each target task, and provides an adapter-based architecture able to compress many experts into a single model.

Big Bird: Transformers for Longer Sequences

It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

Exponentially Increasing the Capacity-to-Computation Ratio for Conditional Computation in Deep Learning

A novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation is proposed, which is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

Mixed Precision Training

This work introduces a technique to train deep neural networks using half precision floating point numbers, and demonstrates that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks.