Improve Transformer Models with Better Relative Position Embeddings

  title={Improve Transformer Models with Better Relative Position Embeddings},
  author={Zhiheng Huang and Davis Liang and Peng Xu and Bing Xiang},
The transformer model has demonstrated superior results on NLP tasks including machine translation and question answering. In this paper, we argue that the position information is not fully utilized in existing work. For example, the initial proposal of a sinusoid embedding is fixed and not learnable. In this paper, we first review the absolute position embeddings and existing relative position embedding methods. We then propose new methods to encourage increased interaction between query, key… 

Figures and Tables from this paper

Multiplicative Position-aware Transformer Models for Language Understanding

It is shown that the proposed embedding method, which served as a drop-in replacement of the default absolute position embedding, can improve the RoberTa-base and RoBERTa-large models on SQuAD1.1 and SQuad2.0 datasets.

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

This paper proposes an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute and relative positions and leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

RoFormer: Enhanced Transformer with Rotary Position Embedding

This paper investigates various methods to integrate positional information into the learning process of transformer-based language models and proposes a novel method named Rotary Position Embedding (RoPE), which encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.

Rethinking and Improving Relative Position Encoding for Vision Transformer

New relative position encoding methods dedicated to 2D images, called image RPE (iRPE), are proposed, which consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism.

The Impact of Positional Encodings on Multilingual Compression

While sinusoidal positional encodings were designed for monolingual applications, they are particularly useful in multilingual language models, because they were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps.

Explore Better Relative Position Embeddings from Encoding Perspective for Transformer Models

This paper investigates the potential problems in Shaw-R PE and XL-RPE, and proposes two novel RPEs called Low-level Fine-grained High-level Coarse- grained (LFHC) RPE and Gaussian Cumulative Distribution Function (GCDF) R PE.

Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

A new positional spacial gating unit (PoSGU) is proposed that exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing and serves as the key building blocks of a new type of vision MLP, referred to as PosMLP.

TANet: Thread-Aware Pretraining for Abstractive Conversational Summarization

It is argued that the inherent contextual dependency among the utterances plays an essential role in understanding the entire conversation and thus a thread-aware Transformer-based network, TAN ET, which achieves a new state-of-the-art in terms of both automatic evaluation and human judgment.

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences, is proposed using conditionally positive definite (CPD) kernels, and it is shown that a CPD kernel can be transformed into a PD kernel by adding a constant offset.

Disentangled Sequence to Sequence Learning for Compositional Generalization

An extension to sequence-to-sequence models which encourage disentanglement by adaptively re-encoding (at each time step) the source input by condition the source representations on the newly decoded target context which makes it easier for the encoder to exploit specialized information for each prediction.



Self-Attention with Relative Position Representations

This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks.

An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation

In experiments on symbolic music, relative selfattention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set, making it possible to train much longer sequences and achieve faster convergence.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

MPNet: Masked and Permuted Pre-training for Language Understanding

This paper proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Multi-Task Deep Neural Networks for Natural Language Understanding

A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.