Sparsifying Transformer Models with Trainable Representation Pooling

  title={Sparsifying Transformer Models with Trainable Representation Pooling},
  author={Michal Pietruszka and Łukasz Borchmann and Lukasz Garncarek},
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-k operator.Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with… 
Leveraging Locality in Abstractive Text Summarization
The experimental results show that the model can have better performance compared with strong baseline models with efficient attention modules, and the analysis provides further insights of the locality-aware modeling strategy.
Linear Complexity Randomized Self-attention Mechanism
A novel perspective to understand the bias in such approximation by recasting RFAs as self-normalized importance samplers is proposed and sheds light on an unbiased estimator for the whole softmax attention, called randomized attention (RA).


Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting
An accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively to generate a concise overall summary is proposed, which achieves the new state-of-the-art on all metrics on the CNN/Daily Mail dataset, as well as significantly higher abstractiveness scores.
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
This work proposes the first model for abstractive summarization of single, longer-form documents (e.g., research papers), consisting of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Linformer: Self-Attention with Linear Complexity
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
Sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities, is proposed, and an unexpected connection between this new loss and the Huber classification loss is revealed.
A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss
By end-to-end training the model with the inconsistency loss and original losses of extractive and abstractive models, the model achieves state-of-the-art ROUGE scores while being the most informative and readable summarization on the CNN/Daily Mail dataset in a solid human evaluation.
On Extractive and Abstractive Neural Document Summarization with Transformer Language Models
A simple extractive step is performed before generating a summary, which is then used to condition the transformer language model on relevant information before being tasked with Generating a summary.
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Bottom-Up Abstractive Summarization
This work explores the use of data-efficient content selectors to over-determine phrases in a source document that should be part of the summary, and shows that this approach improves the ability to compress text, while still generating fluent summaries.
Big Bird: Transformers for Longer Sequences
It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.