Self-Attention with Relative Position Representations

@inproceedings{Shaw2018SelfAttentionWR,
  title={Self-Attention with Relative Position Representations},
  author={Peter Shaw and Jakob Uszkoreit and Ashish Vaswani},
  booktitle={NAACL},
  year={2018}
}
Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider… 

Figures and Tables from this paper

Self-Attention with Structural Position Representations
TLDR
This work uses dependency tree to represent the grammatical structure of a sentence, and proposes two strategies to encode the positional relationships among words in the dependency tree.
Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models
TLDR
Composite attention is proposed, which unites previous relative position encoding methods under a convolutional framework, and finds that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings.
Rethinking Positional Encoding in Language Pre-training
TLDR
This work investigates the problems in the previous formulations and proposes a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE), which can achieve a higher score than baselines while only using 30% pre-training computational costs.
On Scalar Embedding of Relative Positions in Attention Models
TLDR
This work shows that SRPE in attention has an elegant probabilistic interpretation and proposes a new SRPE (AT5) that adopts a learnable bucketization protocol and automatically adapts to the dependency range specific to the learning task.
Rethinking and Improving Relative Position Encoding for Vision Transformer
TLDR
New relative position encoding methods dedicated to 2D images, called image RPE (iRPE), are proposed, which consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism.
Improve Transformer Models with Better Relative Position Embeddings
TLDR
This paper proposes new methods to encourage increased interaction between query, key and relative position embeddings in the self-attention mechanism and demonstrates empirically that the relative embedding method can be reasonably generalized to and is robust in the inductive perspective.
Dynamic Position Encoding for Transformers
TLDR
A novel architecture with new position embeddings depending on the input text to address this shortcoming of Transformers by taking the order of target words into consideration, and is referred to as dynamic position encoding (DPE).
Global Context and Geometric Priors for Effective Non-Local Self-Attention
TLDR
This paper proposes a new relational reasoning module, that incorporates a contextualized diagonal matrix and 2D relative position representations that allows the relational representation of a feature point to encode the whole image context and its relative position information.
Context-Aware Self-Attention Networks
TLDR
This work proposes to contextualize the transformations of the query and key layers, which are used to calculates the relevance between elements, and leverage the internal representations that embed both global and deep contexts, thus avoid relying on external resources.
CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings
TLDR
This paper proposes an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute and relative positions and leads to better generalization performance as well as increased stability with respect to training hyper-parameters.
...
...

References

SHOWING 1-10 OF 17 REFERENCES
Effective Approaches to Attention-based Neural Machine Translation
TLDR
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
End-To-End Memory Networks
TLDR
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.
Graph Attention Networks
We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Neural Machine Translation in Linear Time
TLDR
The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks and the latent alignment structure contained in the representations reflects the expected alignment between the tokens.
Neural Machine Translation by Jointly Learning to Align and Translate
TLDR
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Convolutional Sequence to Sequence Learning
TLDR
This work introduces an architecture based entirely on convolutional neural networks, which outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT-French translation at an order of magnitude faster speed, both on GPU and CPU.
...
...