Corpus ID: 214605593

Learning to Encode Position for Transformer with Continuous Dynamical Model

@article{Liu2020LearningTE,
  title={Learning to Encode Position for Transformer with Continuous Dynamical Model},
  author={Xuanqing Liu and Hsiang-Fu Yu and Inderjit S. Dhillon and Cho-Jui Hsieh},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.09229}
}
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. Unlike RNN and LSTM, which contain inductive bias by loading the input tokens sequentially, non-recurrent models are less sensitive to position. The main reason is that position information among input units is not inherently encoded, i.e., the models are permutation equivalent; this problem justifies why all of the existing models are accompanied by a sinusoidal encoding… Expand
Continuous Self-Attention Models with Neural ODE Networks
TLDR
A lightweight architecture named Continuous Self-Attention models with neural ODE networks (CSAODE), which outperforms state-of-the-art models on text classification tasks, has competitive performances for NLI and text matching tasks as well. Expand
Relative Positional Encoding for Transformers with Linear Complexity
TLDR
Stochastic Positional Encoding is presented as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. Expand
The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models
TLDR
Analysis of position embeddings of existing language models finds strong evidence of translation invariance, which leads to translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embedDings. Expand
On the Ability of Self-Attention Networks to Recognize Counter Languages
TLDR
This work systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so and the influence of positional encoding schemes on the learning and generalization ability of the model. Expand
On Position Embeddings in BERT
TLDR
The first formal and quantitative analysis of desiderata for PEs is contributed, and a principled discussion about their correlation to the performance of typical downstream tasks is discussed. Expand
Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding
TLDR
This paper proposes a novel positional encoding method based on learnable Fourier features that represents each position, which can be multi-dimensional, as a trainable encoding based onlearnable Fouriers feature mapping, modulated with a multi-layer perceptron. Expand
Hierarchical RNNs-Based Transformers MADDPG for Mixed Cooperative-Competitive Environments
TLDR
The hierarchical coding method was applied and validated the effectiveness of this method and proposed a hierarchical transformers MADDPG based on RNN which it is called Hierarchical RNNsBased Transformers MADDPD(HRTMADDPG). Expand
Conformer-based End-to-end Speech Recognition With Rotary Position Embedding
TLDR
This work investigates various position embedding methods in the convolution-augmented transformer (conformer) and adopts a novel implementation named RoPE, which encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. Expand
Position Information in Transformers: An Overview
TLDR
An overview of common methods to incorporate position information into Transformer models is provided to showcase that position information in Transformer is a vibrant and extensive research area and enable the reader to compare existing methods by providing a unified notation and meaningful clustering. Expand
A Survey of Transformers
TLDR
This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications. Expand
...
1
2
...

References

SHOWING 1-10 OF 27 REFERENCES
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Self-Attention with Relative Position Representations
TLDR
This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks. Expand
Are Transformers universal approximators of sequence-to-sequence functions?
TLDR
It is established that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Expand
HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization
TLDR
This work proposes Hibert (as shorthand for HIerachical Bidirectional Encoder Representations from Transformers) for document encoding and a method to pre-train it using unlabeled data and achieves the state-of-the-art performance on these two datasets. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
Neural Ordinary Differential Equations
TLDR
This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models. Expand
Pointer Sentinel Mixture Models
TLDR
The pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank while using far fewer parameters than a standard softmax LSTM and the freely available WikiText corpus is introduced. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
...
1
2
3
...