Position Information in Transformers: An Overview

  title={Position Information in Transformers: An Overview},
  author={Philipp Dufter and Martin Schmitt and Hinrich Sch{\"u}tze},
  journal={Computational Linguistics},
Abstract Transformers are arguably the main workhorse in recent natural language processing research. By definition, a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this article, we provide an overview and theoretical comparison of existing methods to incorporate position information into Transformer models. The objectives of this survey are to (1) showcase… 

Multiplicative Position-aware Transformer Models for Language Understanding

It is shown that the proposed embedding method, which served as a drop-in replacement of the default absolute position embedding, can improve the RoberTa-base and RoBERTa-large models on SQuAD1.1 and SQuad2.0 datasets.

The Impact of Positional Encodings on Multilingual Compression

While sinusoidal positional encodings were designed for monolingual applications, they are particularly useful in multilingual language models, because they were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps.

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Some of the most critical milestones in the field of transformer architecture, as well as overall trends on how transformer architecture has been incorporated into visuolinguistic cross-modal tasks are reviewed.

Frustratingly Easy Performance Improvements for Low-resource Setups: A Tale on BERT and Segment Embeddings

It is found that the default setting for the most used multilingual BERT model underperforms heavily, and a simple swap of the segment embeddings yields an average improvement of 2.5 points absolute LAS score for dependency parsing over 9 different treebanks.

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

This paper proposes an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute and relative positions and leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

SHAPE: Shifted Absolute Position Embedding for Transformers

Shifted absolute position embedding (SHAPE) is investigated to achieve shift invariance, which is a key property of recent successful position representations, by randomly shifting absolute positions during training.

TrueType Transformer: Character and Font Style Recognition in Outline Format

The applicability of T is experimentally shown in character and font style recognition tasks, while observing how the individual control points contribute to classification results.

Position Prediction as an Effective Pretraining Strategy

This paper proposes a novel, but surprisingly simple alternative to content reconstruction – that of predicting locations from content, without providing positional information for it, which enables Transformers trained without position embeddings to outperform ones trained with full position information.

Does chronology matter? Sequential vs contextual approaches to knowledge tracing

A knowledge tracing model based on a general transformer encoder architecture is designed to explore the predictive power of sequentiality for attention-based models and sheds light on bene-fits and challenges of sequential modeling in student performance prediction.

Recent Advances in Neural Text Generation: A Task-Agnostic Survey

A task-agnostic survey of recent advances in neural text generation is presented, which group under the following four headings: data construction, neural frameworks, training and inference strategies, and evaluation metrics.



Demystifying the Better Performance of Position Encoding Variants for Transformer

This work demonstrates a simple yet effective way to encode position and segment into the Transformer models and performs on par with SOTA on GLUE, XTREME and WMT benchmarks while saving computation costs.

An Augmented Transformer Architecture for Natural Language Generation Tasks

An augmented Transformer architecture encoded with additional linguistic knowledge, such as the Part-of-Speech (POS) tagging, to boost the performance on some natural language generation tasks, e.g., the automatic translation and summarization tasks.

Rethinking Positional Encoding in Language Pre-training

This work investigates the problems in the previous formulations and proposes a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE), which can achieve a higher score than baselines while only using 30% pre-training computational costs.

On the Relation between Position Information and Sentence Length in Neural Machine Translation

This study focuses on the position information type of NMT models, and hypothesizes that relative position is better than absolute position, and proposes RNN-Transformer which replaces positional encoding layer of Transformer by RNN, and then compares Rnn-based model and four variants of Trans transformer.

RoFormer: Enhanced Transformer with Rotary Position Embedding

This paper investigates various methods to integrate positional information into the learning process of transformer-based language models and proposes a novel method named Rotary Position Embedding (RoPE), which encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.

Learning to Encode Position for Transformer with Continuous Dynamical Model

A new way of learning to encode position information for non-recurrent models, such as Transformer models, is introduced, borrowing from the recent Neural ODE approach, which may be viewed as a versatile continuous version of a ResNet.

Novel positional encodings to enable tree-based transformers

This work abstracts the transformer's sinusoidal positional encodings, allowing it to instead use a novel positional encoding scheme to represent node positions within trees, achieving superior performance over both sequence-to-sequence transformers and state-of-the-art tree-based LSTMs on several datasets.

Synthesizer: Rethinking Self-Attention in Transformer Models

The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer.

Analysis of Positional Encodings for Neural Machine Translation

This work proposes and analyzes variations of relative positional encoding and observes that the number of trainable parameters can be reduced without a performance loss, by using fixed encoding vectors or by removing some of the positional encoding vectors.

Self-Attention with Structural Position Representations

This work uses dependency tree to represent the grammatical structure of a sentence, and proposes two strategies to encode the positional relationships among words in the dependency tree.