Corpus ID: 229376913

RealFormer: Transformer Likes Residual Attention

@article{He2020RealFormerTL,
  title={RealFormer: Transformer Likes Residual Attention},
  author={Ruining He and Anirudh Ravula and Bhargav Kanagal and Joshua Ainslie},
  journal={ArXiv},
  year={2020},
  volume={abs/2012.11747}
}
  • Ruining He, Anirudh Ravula, +1 author Joshua Ainslie
  • Published 2020
  • Computer Science
  • ArXiv
  • Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple Residual Attention Layer Transformer architecture that significantly outperforms canonical Transformers on a spectrum of tasks including Masked Language Modeling, GLUE, and SQuAD. Qualitatively, RealFormer is easy to implement and requires minimal hyper-parameter tuning. It also stabilizes training and leads to models with sparser attentions. Code will be open-sourced upon paper acceptance. 

    Figures and Tables from this paper

    References

    SHOWING 1-10 OF 28 REFERENCES
    Efficient Transformers: A Survey
    • 26
    • PDF
    Longformer: The Long-Document Transformer
    • 150
    • PDF
    Big Bird: Transformers for Longer Sequences
    • 55
    • PDF
    On Layer Normalization in the Transformer Architecture
    • 36
    • PDF
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    • 14,511
    • Highly Influential
    • PDF
    Generating Long Sequences with Sparse Transformers
    • 247
    • Highly Influential
    • PDF
    ETC: Encoding Long and Structured Data in Transformers
    • 11
    XLNet: Generalized Autoregressive Pretraining for Language Understanding
    • 2,024
    • PDF
    Identity Mappings in Deep Residual Networks
    • 4,392
    • PDF
    Attention is All you Need
    • 15,741
    • Highly Influential
    • PDF