DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

  title={DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering},
  author={Qingqing Cao and H. Trivedi and Aruna Balasubramanian and Niranjan Balasubramanian},
Transformer-based QA models use input-wide self-attention – i.e. across both the question and the input passage – at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing… 

Figures and Tables from this paper

Block-Skim: Efficient Question Answering for Transformer

The key idea of Block-Skim is to identify the context that must be further processed and those that could be safely discarded early on during inference, and finds that such information could be sufficiently derived from the self-attention weights inside the Transformer model.

Exploring Extreme Parameter Compression for Pre-trained Language Models

This work aims to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one, and shows that the proposed method is orthogonal to existing compression methods like knowledge distillation.

Once is Enough: A Light-Weight Cross-Attention for Fast Sentence Pair Modeling

This paper in-troduces a novel paradigm MixEncoder, which conducts query encoding only once while modeling the query-candidate interaction in parallel and can speed up sentence pairing by over 113x while achieving comparable performance as the more expensive cross-attention models.

Learning Dense Representations of Phrases at Scale

This work shows for the first time that it can learn dense representations of phrases alone that achieve much stronger performance in open-domain QA and proposes a query-side fine-tuning strategy, which can support transfer learning and reduce the discrepancy between training and inference.

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

A novel method for applying pretrained transformer language models which lowers their memory requirement both at training and inference time, and removes the fixed context size constraint that most transformer models have, allowing for more flexible use.

Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding

A VQG module that facilitate in automatically generating OOD shifts that facilitates in systematically evaluating cross-dataset adaptation capabilities of VQA models is working on at UCLA.

Efficient Relational Sentence Ordering Network

A novel deep Efficient Relational Sentence Ordering Network (referred to as ERSON) is proposed by leveraging pre-trained language model in both encoder and decoder architectures to strengthen the coherence modeling of the entire model.

Modeling Context in Answer Sentence Selection Systems on a Latency Budget

The best approach, which leverages a multi-way attention architecture to efficiently encode context, improves 6% to 11% over non-contextual state of the art in AS2 with minimal impact on system latency.

Transkimmer: Transformer Learns to Layer-wise Skim

The Transkimmer architecture is proposed, which learns to identify hidden state tokens that are not required by each layer that learns to make the skimming decision, and achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.

Distilled Dual-Encoder Model for Vision-Language Understanding

D I DE 1 is competitive with the fusion-encoder teacher model in performance (only a 1% drop) while enjoying 4 × faster inference, and analyses reveal that the proposed cross-modal attention distillation is crucial to the success of the framework.



BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

A new Q\&A architecture called QANet is proposed, which does not require recurrent networks, and its encoder consists exclusively of convolution and self-attention, where convolution models local interactions andSelf-att attention models global interactions.

A Tensorized Transformer for Language Modeling

A novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD) with tensor train decomposition is proposed, which can not only largely compress the model parameters but also obtain performance improvements.

TinyBERT: Distilling BERT for Natural Language Understanding

A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.

Bidirectional Attention Flow for Machine Comprehension

The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks that can be generically applied to various downstream NLP tasks via simple fine-tuning.

DeQA: On-Device Question Answering

DeQA is presented, a suite of latency- and memory- optimizations that adapts existing QA systems to run completely locally on mobile phones and provides at least 13x speedup on average on the mobile phone across all three datasets.

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

GroupReduce is proposed, a novel compression method for neural language models, based on vocabulary-partition based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words).