Corpus ID: 211126570

Transformer on a Diet

  title={Transformer on a Diet},
  author={Chenguang Wang and Zihao Ye and Aston Zhang and Zheng Zhang and Alex Smola},
Transformer has been widely used thanks to its ability to capture sequence information in an efficient way. However, recent developments, such as BERT and GPT-2, deliver only heavy architectures with a focus on effectiveness. In this paper, we explore three carefully-designed light Transformer architectures to figure out whether the Transformer with less computations could produce competitive results. Experimental results on language model benchmark datasets hint that such trade-off is… Expand
A Practical Survey on Faster and Lighter Transformers
This survey investigates popular approaches to make the Transformer faster and lighter and provides a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions to meet the desired trade-off between capacity, computation, and memory. Expand
Adaptive Multi-Resolution Attention with Linear Complexity
A novel and efficient structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which scales linearly to sequence length in terms of time and space, and leverages a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion. Expand
SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
It is demonstrated how to replace several operations in self-attention layers with grouped convolutions, and this technique is used in a novel network architecture called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. Expand
TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP
TX-Ray expresses neurons as feature preference distributions to quantify fine-grained knowledge transfer or adaptation and guide human analysis and it is found that, similar to Lottery Ticket based pruning, TX-Ray based pruned can improve test set generalization and it can reveal how early stages of self-supervision automatically learn linguistic abstractions like parts-of-speech. Expand


Language Models with Transformers
This paper explores effective Transformer architectures for language model, including adding additional LSTM layers to better capture the sequential context while still keeping the computation efficient, and proposes Coordinate Architecture Search (CAS) to find an effective architecture through iterative refinement of the model. Expand
Star-Transformer replaces the fully-connected structure with a star-shaped topology, in which every two non-adjacent nodes are connected through a shared relay node, and complexity is reduced from quadratic to linear, while preserving the capacity to capture both local composition and long-range dependency. Expand
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Ad adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), BP-Transformer (BPT for short) is proposed, which has a superior performance for long text than previous self-attention models. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
On the State of the Art of Evaluation in Neural Language Models
This work reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrives at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
It is shown that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck, and a simple and effective method is proposed to address this issue. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand