ABC: Attention with Bounded-memory Control

  title={ABC: Attention with Bounded-memory Control},
  author={Hao Peng and Jungo Kasai and Nikolaos Pappas and Dani Yogatama and Zhaofeng Wu and Lingpeng Kong and Roy Schwartz and Noah A. Smith},
Transformer architectures have achieved state- of-the-art results on a variety of natural language processing (NLP) tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from… 
Linear Complexity Randomized Self-attention Mechanism
A novel perspective to understand the bias in such approximation by recasting RFAs as self-normalized importance samplers is proposed and sheds light on an unbiased estimator for the whole softmax attention, called randomized attention (RA).
An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers
FLOTA (Few Longest Token Approximation) leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise.
The NLP Task Effectiveness of Long-Range Transformers
It is found that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens.


Random Feature Attention
RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets.
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding
Cluster-Former is proposed, a novel clustering-based sparse Transformer to perform attention across chunked sequences that allows information integration beyond local windows, which is especially beneficial for question answering (QA) and language modeling tasks that rely on long-range dependencies.
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
This work expresses the self-attention as a linear dot-product of kernel feature maps and makes use of the associativity property of matrix products to reduce the complexity from O(N) to N, where N is the sequence length.
Linformer: Self-Attention with Linear Complexity
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Findings of the 2014 Workshop on Statistical Machine Translation
This paper presents the results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translation task, a task for run-time estimation of machine translation
2020b. Linformer: Selfattention with linear complexity
  • 2020
Findings of the 2014 workshop
  • 2014
Linformer: Selfattention with linear complexity
  • 2020