• Corpus ID: 237940517

Multiplicative Position-aware Transformer Models for Language Understanding

  title={Multiplicative Position-aware Transformer Models for Language Understanding},
  author={Zhiheng Huang and Davis Liang and Peng Xu and Bing Xiang},
In order to utilize positional ordering infor- 001 mation in transformer models, various flavors 002 of absolute and relative position embeddings 003 have been proposed. However, there is no com- 004 prehensive comparison of position embedding 005 methods in the literature. In this paper, we 006 review existing position embedding methods 007 and compare their accuracy on downstream 008 NLP tasks, using our own implementations. 009 We also propose a novel multiplicative embed- 010 ding method… 

Figures and Tables from this paper



Improve Transformer Models with Better Relative Position Embeddings

This paper proposes new methods to encourage increased interaction between query, key and relative position embeddings in the self-attention mechanism and demonstrates empirically that the relative embedding method can be reasonably generalized to and is robust in the inductive perspective.

Rethinking Positional Encoding in Language Pre-training

This work investigates the problems in the previous formulations and proposes a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE), which can achieve a higher score than baselines while only using 30% pre-training computational costs.

Self-Attention with Relative Position Representations

This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

MPNet: Masked and Permuted Pre-training for Language Understanding

This paper proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

Position Information in Transformers: An Overview

An overview and theoretical comparison of existing methods to incorporate position information into Transformer models is provided and what characteristics of an application should be taken into account when selecting a position encoding is indicated.

An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation

In experiments on symbolic music, relative selfattention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set, making it possible to train much longer sequences and achieve faster convergence.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

A new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) is proposed that improves the BERT and RoBERTa models using two novel techniques that significantly improve the efficiency of model pre-training and performance of downstream tasks.