Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

@article{Tang2018WhySA,
  title={Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures},
  author={Gongbo Tang and M. M{\"u}ller and Annette Rios Gonzales and Rico Sennrich},
  journal={ArXiv},
  year={2018},
  volume={abs/1808.08946}
}
  • Gongbo Tang, M. Müller, +1 author Rico Sennrich
  • Published 2018
  • Computer Science
  • ArXiv
  • Recently, non-recurrent architectures (convolutional, self-attentional) have outperformed RNNs in neural machine translation. [...] Key Result Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.Expand Abstract
    Pay Less Attention with Lightweight and Dynamic Convolutions
    • 172
    • PDF
    Assessing the Ability of Self-Attention Networks to Learn Word Order
    • 14
    • Highly Influenced
    • PDF
    A Structural Probe for Finding Syntax in Word Representations
    • 201
    • PDF
    Are Sixteen Heads Really Better than One?
    • 118
    • PDF
    Adaptively Sparse Transformers
    • 42
    • PDF
    Augmenting Neural Machine Translation with Knowledge Graphs
    • 7
    • Highly Influenced
    • PDF

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 31 REFERENCES
    Attention is All you Need
    • 11,913
    • Highly Influential
    • PDF
    Neural Machine Translation by Jointly Learning to Align and Translate
    • 12,929
    • PDF
    Sequence to Sequence Learning with Neural Networks
    • 10,545
    • PDF
    Neural Machine Translation of Rare Words with Subword Units
    • 2,731
    • PDF
    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
    • 9,523
    • PDF
    Long Short-Term Memory
    • 31,049
    • Highly Influential
    • PDF
    Finding Structure in Time
    • 8,419
    • PDF
    Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
    • 378
    • PDF
    Adam: A Method for Stochastic Optimization
    • 49,946
    • PDF
    An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
    • 751
    • PDF