Corpus ID: 218487423

Synthesizer: Rethinking Self-Attention in Transformer Models

@article{Tay2021SynthesizerRS,
  title={Synthesizer: Rethinking Self-Attention in Transformer Models},
  author={Yi Tay and Dara Bahri and Donald Metzler and Da-Cheng Juan and Zhe Zhao and Che Zheng},
  journal={ArXiv},
  year={2021},
  volume={abs/2005.00743}
}
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that… Expand

Figures and Tables from this paper

Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help !
Multi-Head Attention: Collaborate Instead of Concatenate
Random Feature Attention
Not all parameters are born equal: Attention is mostly what you need
AutoBERT-Zero: Evolving BERT Backbone from Scratch
  • Jiahui Gao, Hang Xu, +5 authors Zhenguo Li
  • Computer Science
  • ArXiv
  • 2021
Selective Knowledge Distillation for Neural Machine Translation
Not All Attention Is All You Need
An Attention Free Transformer
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Attention is All you Need
Self-Attention with Relative Position Representations
Music Transformer.
Universal Transformers
...
1
2
3
4
...