Corpus ID: 218487423

Synthesizer: Rethinking Self-Attention in Transformer Models

@article{Tay2020SynthesizerRS,
  title={Synthesizer: Rethinking Self-Attention in Transformer Models},
  author={Yi Tay and Dara Bahri and Donald Metzler and Da-Cheng Juan and Zhe Zhao and Che Zheng},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.00743}
}
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that… Expand

Figures and Tables from this paper

Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help !
Multi-Head Attention: Collaborate Instead of Concatenate
Random Feature Attention
PairConnect: A Compute-Efficient MLP Alternative to Attention
Not all parameters are born equal: Attention is mostly what you need
Selective Knowledge Distillation for Neural Machine Translation
Not All Attention Is All You Need
  • Hongqiu Wu, Hai Zhao, Min Zhang
  • Computer Science
  • ArXiv
  • 2021
An Attention Free Transformer
Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 30 REFERENCES
Attention is All you Need
Self-Attention with Relative Position Representations
Music Transformer.
Effective Approaches to Attention-based Neural Machine Translation
Language Modeling with Gated Convolutional Networks
Universal Transformers
Gated Self-Matching Networks for Reading Comprehension and Question Answering
...
1
2
3
...