DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

@inproceedings{Cao2020DeFormerDP,
  title={DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering},
  author={Qingqing Cao and H. Trivedi and A. Balasubramanian and Niranjan Balasubramanian},
  booktitle={ACL},
  year={2020}
}
Transformer-based QA models use input-wide self-attention – i.e. across both the question and the input passage – at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing… Expand
ReadOnce Transformers: Reusable Representations of Text for Transformers
TLDR
This work presents a transformer-based approach, ReadOnce Transformers, that is trained to build such information-capturing representations of text that can be re-used in different examples and tasks, thereby requiring a document to only be read once. Expand
Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size
TLDR
A novel method for applying pretrained transformer language models which lowers their memory requirement both at training and inference time, and removes the fixed context size constraint that most transformer models have, allowing for more flexible use. Expand
Learning Dense Representations of Phrases at Scale
TLDR
This work shows for the first time that it can learn dense phrase representations alone that achieve much stronger performance in open-domain QA, and directly uses DensePhrases as a dense knowledge base for downstream tasks. Expand
Modeling Context in Answer Sentence Selection Systems on a Latency Budget
TLDR
The best approach, which leverages a multi-way attention architecture to efficiently encode context, improves 6% to 11% over non-contextual state of the art in AS2 with minimal impact on system latency. Expand
Optimizing Inference Performance of Transformers on CPUs
TLDR
Focusing on the highly popular BERT model, this paper identifies key components of the Transformer architecture where the bulk of the computation happens, and proposes an Adaptive Linear Module Optimization (ALMO) to speed them up. Expand
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference
TLDR
A dynamic token reduction approach to accelerate PLMs’ inference, named TR-BERT, which could flexibly adapt the layer number of each token in inference to avoid redundant calculation and achieve better performance with less computation in a suite of long-text tasks since its token-level layer number adaption greatly accelerates the self-attention operation in PLMs. Expand
NA-Aware Machine Reading Comprehension for Document-Level Relation Extraction
  • Zhenyu Zhang, Bowen Yu, Xiaobo Shu, Tingwen Liu
  • Computer Science
  • ECML/PKDD
  • 2021
Document-level relation extraction aims to identify semantic relations between target entities from the document. Most of the existing work roughly treats the document as a long sequence and producesExpand
Which *BERT? A Survey Organizing Contextualized Encoders
TLDR
A survey on language representation learning is presented with the aim of consolidating a series of shared lessons learned across a variety of recent efforts, and highlights important considerations when interpreting recent contributions and choosing which model to use. Expand
Adapting by Pruning: A Case Study on BERT
TLDR
This work proposes a novel model adaptation paradigm, adapting by pruning, which prunes neural connections in the pre-trained model to optimise the performance on the target task; all remaining connections have their weights intact. Expand
A Comprehensive Survey on Schema-based Event Extraction with Deep Learning
  • Qian Li, Hao Peng, +8 authors Philip S. Yu
  • Computer Science
  • 2021
Schema-based event extraction is a critical technique to apprehend the essential content of events promptly. With the rapid development of deep learning technology, event extraction technology basedExpand
...
1
2
...

References

SHOWING 1-10 OF 36 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension
TLDR
A new Q\&A architecture called QANet is proposed, which does not require recurrent networks, and its encoder consists exclusively of convolution and self-attention, where convolution models local interactions andSelf-att attention models global interactions. Expand
A Tensorized Transformer for Language Modeling
TLDR
A novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD) with tensor train decomposition is proposed, which can not only largely compress the model parameters but also obtain performance improvements. Expand
Bidirectional Attention Flow for Machine Comprehension
TLDR
The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Expand
Are Sixteen Heads Really Better than One?
TLDR
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
TLDR
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo. Expand
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
TLDR
MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks that can be generically applied to various downstream NLP tasks via simple fine-tuning. Expand
...
1
2
3
4
...