Self-Attention with Relative Position Representations
@inproceedings{Shaw2018SelfAttentionWR, title={Self-Attention with Relative Position Representations}, author={Peter Shaw and Jakob Uszkoreit and Ashish Vaswani}, booktitle={NAACL}, year={2018} }
Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider…
926 Citations
Self-Attention with Structural Position Representations
- Computer ScienceEMNLP
- 2019
This work uses dependency tree to represent the grammatical structure of a sentence, and proposes two strategies to encode the positional relationships among words in the dependency tree.
Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models
- Computer ScienceACL
- 2021
Composite attention is proposed, which unites previous relative position encoding methods under a convolutional framework, and finds that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings.
Rethinking Positional Encoding in Language Pre-training
- Computer ScienceICLR
- 2021
This work investigates the problems in the previous formulations and proposes a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE), which can achieve a higher score than baselines while only using 30% pre-training computational costs.
On Scalar Embedding of Relative Positions in Attention Models
- Computer ScienceAAAI
- 2021
This work shows that SRPE in attention has an elegant probabilistic interpretation and proposes a new SRPE (AT5) that adopts a learnable bucketization protocol and automatically adapts to the dependency range specific to the learning task.
Rethinking and Improving Relative Position Encoding for Vision Transformer
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
New relative position encoding methods dedicated to 2D images, called image RPE (iRPE), are proposed, which consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism.
Improve Transformer Models with Better Relative Position Embeddings
- Computer ScienceFINDINGS
- 2020
This paper proposes new methods to encourage increased interaction between query, key and relative position embeddings in the self-attention mechanism and demonstrates empirically that the relative embedding method can be reasonably generalized to and is robust in the inductive perspective.
Dynamic Position Encoding for Transformers
- Computer ScienceArXiv
- 2022
A novel architecture with new position embeddings depending on the input text to address this shortcoming of Transformers by taking the order of target words into consideration, and is referred to as dynamic position encoding (DPE).
Global Context and Geometric Priors for Effective Non-Local Self-Attention
- Computer ScienceBMVC
- 2021
This paper proposes a new relational reasoning module, that incorporates a contextualized diagonal matrix and 2D relative position representations that allows the relational representation of a feature point to encode the whole image context and its relative position information.
Context-Aware Self-Attention Networks
- Computer ScienceAAAI
- 2019
This work proposes to contextualize the transformations of the query and key layers, which are used to calculates the relevance between elements, and leverage the internal representations that embed both global and deep contexts, thus avoid relying on external resources.
CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings
- Computer ScienceNeurIPS
- 2021
This paper proposes an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute and relative positions and leads to better generalization performance as well as increased stability with respect to training hyper-parameters.
References
SHOWING 1-10 OF 17 REFERENCES
Effective Approaches to Attention-based Neural Machine Translation
- Computer ScienceEMNLP
- 2015
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
Attention is All you Need
- Computer ScienceNIPS
- 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
End-To-End Memory Networks
- Computer ScienceNIPS
- 2015
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.
Graph Attention Networks
- Computer ScienceICLR
- 2018
We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior…
Rethinking the Inception Architecture for Computer Vision
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
- Computer ScienceArXiv
- 2016
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Sequence to Sequence Learning with Neural Networks
- Computer ScienceNIPS
- 2014
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Neural Machine Translation in Linear Time
- Computer ScienceArXiv
- 2016
The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks and the latent alignment structure contained in the representations reflects the expected alignment between the tokens.
Neural Machine Translation by Jointly Learning to Align and Translate
- Computer ScienceICLR
- 2015
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Convolutional Sequence to Sequence Learning
- Computer ScienceICML
- 2017
This work introduces an architecture based entirely on convolutional neural networks, which outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT-French translation at an order of magnitude faster speed, both on GPU and CPU.