• Corpus ID: 218487423

Synthesizer: Rethinking Self-Attention in Transformer Models

@inproceedings{Tay2021SynthesizerRS,
  title={Synthesizer: Rethinking Self-Attention in Transformer Models},
  author={Yi Tay and Dara Bahri and Donald Metzler and Da-Cheng Juan and Zhe Zhao and Che Zheng},
  booktitle={ICML},
  year={2021}
}
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that… 

Figures and Tables from this paper

Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help !
TLDR
This paper presents a novel parameter-lean self-attention mechanism using discourse priors that achieves competitive ROUGE-scores on the task of extractive summarization and significantly outperform the 8-head transformer model on sentence level when applying a more balanced hyper-parameter setting.
Human Interpretation and Exploitation of Self-attention Patterns in Transformers: A Case Study in Extractive Summarization
TLDR
This paper synergizes two lines of research in a human-in-the-loop pipeline to first find important task-specific attention patterns in the popular BERTSum model, and indicates that when such patterns are injected, both the original and the smaller model show improvements in performance and arguably interpretability.
CoCon: A Self-Supervised Approach for Controlled Text Generation
TLDR
This work proposes Content-Conditioner (CoCon) to control an LM's output text with a target content, at a fine-grained level and shows that CoCon can naturally incorporate target content into generated texts and control high-level text attributes in a zero-shot manner.
Multi-Head Attention: Collaborate Instead of Concatenate
TLDR
A collaborative multi-head attention layer that enables heads to learn shared projections and improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture.
Cluster-Former: Clustering-based Sparse Transformer for Question Answering
TLDR
Cluster-Former is proposed, a novel clusteringbased sparse Transformer to perform attention across chunked sequences that allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies.
Random Feature Attention
TLDR
RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets.
PairConnect: A Compute-Efficient MLP Alternative to Attention
TLDR
This work revisits the memory-compute trade-off associated with Transformer, particularly multi-head attention, and shows a memory-heavy but significantly more compute-efficient alternative to Transformer.
Not all parameters are born equal: Attention is mostly what you need
TLDR
While the embedding layer is the least essential for machine translation tasks, it is the most important component for language modelling tasks.
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
TLDR
This paper proposes to replace all but one attention head of each encoder layer with simple fixed – non-learnable – attentive patterns that are solely based on position and do not require any external knowledge.
Learning Hard Retrieval Decoder Attention for Transformers
TLDR
An approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens, which is 1.43 times faster in decoding and preserves translation quality on a wide range of machine translation tasks when used in the decoder selfand crossattention networks.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Linformer: Self-Attention with Linear Complexity
TLDR
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.
Longformer: The Long-Document Transformer
TLDR
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
TLDR
This paper proposes to replace all but one attention head of each encoder layer with simple fixed – non-learnable – attentive patterns that are solely based on position and do not require any external knowledge.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Self-Attention with Relative Position Representations
TLDR
This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks.
Pay Less Attention with Lightweight and Dynamic Convolutions
TLDR
It is shown that a very lightweight convolution can perform competitively to the best reported self-attention results, and dynamic convolutions are introduced which are simpler and more efficient than self-ATTention.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Music Transformer
TLDR
It is demonstrated that a Transformer with the modified relative attention mechanism can generate minute-long compositions with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies.
Effective Approaches to Attention-based Neural Machine Translation
TLDR
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
Language Modeling with Gated Convolutional Networks
TLDR
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
...
...