Efficient Sparsely Activated Transformers

@article{Latifi2022EfficientSA,
  title={Efficient Sparsely Activated Transformers},
  author={Salar Latifi and Saurav Muralidharan and Michael Garland},
  journal={ArXiv},
  year={2022},
  volume={abs/2208.14580}
}
Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains including natural language processing and computer vision. To further improve their accuracy, recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert (MoE) layers. In this paper, we explore the introduction of MoE layers to optimize a different metric: inference latency. We introduce a novel system named PLANER that… 

References

SHOWING 1-10 OF 19 REFERENCES

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

Transformers: State-of-the-Art Natural Language Processing

Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

Transformers : State-ofthe-art Natural Language Processing

Transformers is presented, a library for state-of-the-art NLP, making these developments available to the community by gathering state of theart general-purpose pretrained models under a unified API together with an ecosystem of libraries, examples, tutorials and scripts targeting many downstream NLP tasks.

Neural Architecture Search with Reinforcement Learning

This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.

Improving Transformer Models by Reordering their Sublayers

This work proposes a new transformer pattern that adheres to this property, the sandwich transformer, and shows that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time.

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

This work designs Hardware-Aware Transformers with neural architecture search, and trains a SuperTransformer that covers all candidates in the design space, and efficiently produces many SubTransformers with weight sharing, and performs an evolutionary search with a hardware latency constraint.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.