Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models
@inproceedings{Ye2021SparseDS, title={Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models}, author={Qinyuan Ye and Madian Khabsa and Mike Lewis and Sinong Wang and Xiang Ren and Aaron Jaech}, booktitle={North American Chapter of the Association for Computational Linguistics}, year={2021} }
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. The student models are typically compact transformers with fewer parameters, while expensive operations such as self-attention persist. Therefore, the improved inference speed may still be unsatisfactory for real-time or high-volume use cases. In this paper, we aim to further push the limit of inference speed by distilling teacher models into bigger…
Figures and Tables from this paper
One Citation
RESEARCH PROPOSAL: EVALUATING AND ENABLING HUMAN–AI COLLABORATION
- Computer Science
- 2022
The goal is to create metrics to measure whether ai methods make sense to users, helping users craft examples to advance ai, and applying ai to applications that help illuminate complex social science applications.
References
SHOWING 1-10 OF 50 REFERENCES
Are Pretrained Convolutions Better than Pretrained Transformers?
- Computer ScienceACL
- 2021
CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats, and suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently.
Hard-Coded Gaussian Attention for Neural Machine Translation
- Computer ScienceACL
- 2020
A “hard-coded” attention variant without any learned parameters is developed, which offers insight into which components of the Transformer are actually important, which it is hoped will guide future work into the development of simpler and more efficient attention-based models.
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
- Computer ScienceACL
- 2020
MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks that can be generically applied to various downstream NLP tasks via simple fine-tuning.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Computer ScienceArXiv
- 2019
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
- Computer ScienceArXiv
- 2019
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.
Deep Unordered Composition Rivals Syntactic Methods for Text Classification
- Computer ScienceACL
- 2015
This work presents a simple deep neural network that competes with and, in some cases, outperforms such models on sentiment analysis and factoid question answering tasks while taking only a fraction of the training time.
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- Computer ScienceEMNLP
- 2013
A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Computer ScienceArXiv
- 2019
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
- Computer ScienceTransactions of the Association for Computational Linguistics
- 2021
It is found that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster, and a new QA-pair retriever, RePAZ, is introduced to complement PAQ.
Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings
- Computer ScienceACL
- 2020
Simple and fully general methods for converting from contextualized representations to static lookup-table embeddings are introduced which are applied to 5 popular pretrained models and 9 sets of pretrained weights and reveal that pooling over many contexts significantly improves representational quality under intrinsic evaluation.