Speechmoe2: Mixture-of-Experts Model with Improved Routing

@article{You2021Speechmoe2MM,
  title={Speechmoe2: Mixture-of-Experts Model with Improved Routing},
  author={Zhao You and Shulin Feng and Dan Su and Dong Yu},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={7217-7221}
}
  • Zhao YouShulin Feng Dong Yu
  • Published 23 November 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Mixture-of-experts based acoustic models with dynamic routing mechanisms have proved promising results for speech recognition. The design principle of router architecture is important for the large model capacity and high computational efficiency. Our previous work SpeechMoE only uses local grapheme embedding to help routers to make route decisions. To further improve speech recognition performance against varying domains and accents, we propose a new router architecture which integrates… 

Figures and Tables from this paper

Knowledge Distillation for Mixture of Experts Models in Speech Recognition

This work proposes a simple approach to distill MoE models into dense models while retaining the accuracy gain achieved by large sparse models, and demonstrates the model compression efficiency of the knowledge distillation (KD) technique through multi-lingual speech recognition experiments.

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

The 3M model is summarized as multi-loss, multi-path and multi-level, summarized as ”3M” model, which can effectively increase the model capacity without remarkably increasing computation cost for ASR tasks.

Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

This work investigates how multi-lingual Automatic Speech Recognition networks can be scaled up with a simple routing algorithm in order to achieve better accuracy.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.

Designing Effective Sparse Expert Models

  • Barret ZophIrwan Bello W. Fedus
  • Computer Science
    2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2022
This work scales a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B), and for the first time achieves state-of theart performance in transfer learning.

A Review of Sparse Expert Models in Deep Learning

The concept of sparse expert models is reviewed, a basic description of the common algorithms is provided, the advances in the deep learning era are contextualized, and areas for future work are highlighted.

Sparsity-Constrained Optimal Transport

A new approach for OT with explicit cardinality constraints on the transportation plan, motivated by an application to sparse mixture of experts, where OT can be used to match input tokens such as image patches with expert models such as neural networks.

References

SHOWING 1-10 OF 15 REFERENCES

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

SpeechMoE, a MoE based model for speech recognition, can achieve lower character error rate (CER) with comparable computation cost than traditional static networks, and a new router architecture is used in this work which can simultaneously utilize the information from a shared embedding network and the hierarchical representation of different MoE layers.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

A first-pass Recurrent Neural Network Transducer model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency is developed and found that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional models.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

Dynamic Routing Networks

  • Shaofeng CaiYao ShuWei Wang
  • Computer Science
    2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2021
This paper proposes Dynamic Routing Networks (DRNets), which support efficient instance-aware inference by routing the input instance to only necessary transformation branches selected from a candidate set of branches for each connection between transformation nodes.

Deep Mixture of Experts via Shallow Embedding

This work explores a mixture of experts (MoE) approach to deep dynamic routing, which activates certain experts in the network on a per-example basis, and shows that Deep-MoEs are able to achieve higher accuracy with lower computation than standard convolutional networks.

Runtime Neural Pruning

A Runtime Neural Pruning (RNP) framework which prunes the deep neural network dynamically at the runtime and preserves the full ability of the original network and conducts pruning according to the input image and current feature maps adaptively.

Multi-Scale Dense Networks for Resource Efficient Image Classification

Experiments demonstrate that the proposed framework substantially improves the existing state-of-the-art in both image classification with computational resource limits at test time and budgeted batch classification.

FastMoE: A Fast Mixture-of-Expert Training System

This paper presents FastMoE, a distributed MoE training system based on PyTorch with common accelerators that provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM.

Hard Mixtures of Experts for Large Scale Weakly Supervised Vision

This work shows that a simple hard mixture of experts model can be efficiently trained to good effect on large scale hashtag (multilabel) prediction tasks, and demonstrates that it is feasible (and in fact relatively painless) to train far larger models than could be practically trained with standard CNN architectures, and that the extra capacity can be well used on current datasets.