MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

  title={MegaBlocks: Efficient Sparse Training with Mixture-of-Experts},
  author={Trevor Gale and Deepak Narayanan and Cliff Young and Matei A. Zaharia},
We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we… 

Figures and Tables from this paper



Tutel: Adaptive Mixture-of-Experts at Scale

Tutel is presented, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining with superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference.

Triton: an intermediate language and compiler for tiled neural network computations

Triton is presented, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays for expressing tensor programs in terms of operations on parametric tile variables and a set of novel tile-level optimization passes for compiling these programs into efficient GPU code.

Fast Sparse ConvNets

This work introduces a family of efficient sparse kernels for several hardware platforms, and shows that sparse versions of MobileNet v1 and Mobile net v2 architectures substantially outperform strong dense baselines on the efficiency-accuracy curve.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.

Scaling Vision with Sparse Mixture of Experts

This work presents a Vision MoE, a sparse version of the Vision Transformer that is scalable and competitive with the largest dense networks, and proposes an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute.

FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models

This work designs a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed, and implements and integrates the above optimizations as a general system, FasterMoE, empowering efficient distributed MoE model training.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Balanced Sparsity for Efficient DNN Inference on GPU

This paper proposes a novel fine-grained sparsity approach, Balanced Sparsity, to achieve high model accuracy with commercial hardwares efficiently and adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services.

Sparse GPU Kernels for Deep Learning

This work develops high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks: sparse matrix–dense matrix multiplication and sampled dense– dense matrix multiplication.