• Corpus ID: 237563187

Primer: Searching for Efficient Transformers for Language Modeling

@article{So2021PrimerSF,
  title={Primer: Searching for Efficient Transformers for Language Modeling},
  author={David R. So and Wojciech Ma'nke and Hanxiao Liu and Zihang Dai and Noam M. Shazeer and Quoc V. Le},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.08668}
}
Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a… 
A Fast Post-Training Pruning Framework for Transformers
TLDR
A fast post-training pruning framework for Transformers that prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain.
Designing Effective Sparse Expert Models
TLDR
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
TLDR
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
Efficient Training of Audio Transformers with Patchout
TLDR
This work proposes a novel method to op-timize and regularize transformers on audio spectrograms with a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
Flamingo: a Visual Language Model for Few-Shot Learning
TLDR
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
Transformer Quality in Linear Time
TLDR
This work revisit the design choices in Transformers, and proposes a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss, and a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality.
NormFormer: Improved Transformer Pretraining with Extra Normalization
TLDR
The proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a LayerNorm after the first fully connected layer, improves pretraining perplexity and downstream task performance for both causal and masked language models.
LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models
TLDR
This work rigorously shows that the latency and perplexity pareto-frontier can be found without need for any model training, using non-embedding parameters as a proxy for perplexity, and organically induces a simple search algorithm that can be directly run on target devices.
αNAS: Neural Architecture Search using Property Guided Synthesis
TLDR
This work develops techniques that enable efficient NAS in a significantly larger design space, and proposes an efficient synthesis procedure, which accepts a set of promising program properties, and returns a satisfying neural architecture.
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework
TLDR
This work proposes a simple and efficient learning framework TLM that does not rely on large-scale pretraining and achieves results better than or similar to pretrained language models while reducing the training FLOPs by two orders of magnitude.
...
1
2
...

References

SHOWING 1-10 OF 64 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
The Evolved Transformer
TLDR
The Progressive Dynamic Hurdles method is developed, which allows us to dynamically allocate more resources to more promising candidate models on the computationally expensive WMT 2014 English-German translation task, and demonstrates consistent improvement over the Transformer on four well-established language tasks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
TLDR
The proposed training techniques mitigate the instabilities, and it is shown large sparse models may be trained, for the first time, with lower precision (bfloat16) formats and achieve a 4x speedup over the T5-XXL model.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Language Modeling with Gated Convolutional Networks
TLDR
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
One billion word benchmark for measuring progress in statistical language modeling
TLDR
A new benchmark corpus to be used for measuring progress in statistical language modeling, with almost one billion words of training data, is proposed, which is useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
TensorFlow: A system for large-scale machine learning
TLDR
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
...
1
2
3
4
5
...