• Corpus ID: 215737171

Longformer: The Long-Document Transformer

  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task… 

Memformer: The Memory-Augmented Transformer

Results show that Memformer outperforms the previous long-range sequence models on WikiText-103, including Transformer-XL and compressive Transformer, and is also compatible with other self-supervised tasks to further improve the performance on language modeling.

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Two welldesigned techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-DOC 1, which has a much longer effective context length, to capture the contextual information of a complete document.

Linformer: Self-Attention with Linear Complexity

This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.

Random Feature Attention

RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets.

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

This work proposes Nyströmformer - a model that exhibits favorable scalability as a function of sequence length and performs favorably relative to other efficient self-attention methods.

Memory Transformer

This work proposes and study two extensions of the Transformer baseline by adding memory tokens to store non-local representations, and creating memory bottleneck for the global information, and evaluates these memory augmented Transformers on machine translation task and demonstrates that memory size positively correlates with the model performance.

Synthesizer: Rethinking Self-Attention in Transformer Models

The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer.

Hierarchical Learning for Generation with Long Source Sequences

A new Hierarchical Attention Transformer-based architecture (HAT) that outperforms standard Transformers on several sequence to sequence tasks and investigates what the hierarchical layers learn by visualizing the hierarchical encoder-decoder attention.

Learning Hard Retrieval Cross Attention for Transformer

The hard retrieval attention mechanism can empirically accelerate the scaled dot-product attention for both long and short sequences by 66.5%, while performing competitively in a wide range of machine translation tasks when using for cross attention networks.

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Cluster-Former is proposed, a novel clustering-based sparse Transformer to perform attention across chunked sequences that allows information integration beyond local windows, which is especially beneficial for question answering (QA) and language modeling tasks that rely on long-range dependencies.



Generating Long Sequences with Sparse Transformers

This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

Big Bird: Transformers for Longer Sequences

It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Ad adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), BP-Transformer (BPT for short) is proposed, which has a superior performance for long text than previous self-attention models.

Pay Less Attention with Lightweight and Dynamic Convolutions

It is shown that a very lightweight convolution can perform competitively to the best reported self-attention results, and dynamic convolutions are introduced which are simpler and more efficient than self-ATTention.

ETC: Encoding Long and Structured Inputs in Transformers

A new Transformer architecture, Extended Transformer Construction (ETC), is presented that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Span Selection Pre-training for Question Answering

This paper introduces a new pre-training task inspired by reading comprehension to better align the pre- training from memorization to understanding, and shows that the proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction.