MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
@article{Gale2022MegaBlocksES, title={MegaBlocks: Efficient Sparse Training with Mixture-of-Experts}, author={Trevor Gale and Deepak Narayanan and Cliff Young and Matei A. Zaharia}, journal={ArXiv}, year={2022}, volume={abs/2211.15841} }
We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we…
Figures and Tables from this paper
References
SHOWING 1-10 OF 38 REFERENCES
Tutel: Adaptive Mixture-of-Experts at Scale
- Computer ScienceArXiv
- 2022
Tutel is presented, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining with superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference.
Triton: an intermediate language and compiler for tiled neural network computations
- Computer ScienceMAPL@PLDI
- 2019
Triton is presented, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays for expressing tensor programs in terms of operations on parametric tile variables and a set of novel tile-level optimization passes for compiling these programs into efficient GPU code.
Fast Sparse ConvNets
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work introduces a family of efficient sparse kernels for several hardware platforms, and shows that sparse versions of MobileNet v1 and Mobile net v2 architectures substantially outperform strong dense baselines on the efficiency-accuracy curve.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Computer ScienceArXiv
- 2021
This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.
Scaling Vision with Sparse Mixture of Experts
- Computer ScienceNeurIPS
- 2021
This work presents a Vision MoE, a sparse version of the Vision Transformer that is scalable and competitive with the largest dense networks, and proposes an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute.
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
- Computer SciencePPoPP
- 2022
This work designs a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed, and implements and integrates the above optimizations as a general system, FasterMoE, empowering efficient distributed MoE model training.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- Computer ScienceICLR
- 2017
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Computer ScienceICLR
- 2021
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
Balanced Sparsity for Efficient DNN Inference on GPU
- Computer ScienceAAAI
- 2019
This paper proposes a novel fine-grained sparsity approach, Balanced Sparsity, to achieve high model accuracy with commercial hardwares efficiently and adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services.
Sparse GPU Kernels for Deep Learning
- Computer ScienceSC20: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2020
This work develops high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks: sparse matrix–dense matrix multiplication and sampled dense– dense matrix multiplication.