Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

@inproceedings{Ye2022SparseDS,
  title={Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models},
  author={Qinyuan Ye and Madian Khabsa and Mike Lewis and Sinong Wang and Xiang Ren and Aaron Jaech},
  booktitle={NAACL},
  year={2022}
}
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. The student models are typically compact transformers with fewer parameters, while expensive operations such as self-attention persist. Therefore, the improved inference speed may still be unsatisfactory for real-time or high-volume use cases. In this paper, we aim to further push the limit of inference speed by distilling teacher models into bigger… 

References

SHOWING 1-10 OF 50 REFERENCES

Are Pretrained Convolutions Better than Pretrained Transformers?

CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats, and suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently.

Hard-Coded Gaussian Attention for Neural Machine Translation

A “hard-coded” attention variant without any learned parameters is developed, which offers insight into which components of the Transformer are actually important, which it is hoped will guide future work into the development of simpler and more efficient attention-based models.

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks that can be generically applied to various downstream NLP tasks via simple fine-tuning.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.

Deep Unordered Composition Rivals Syntactic Methods for Text Classification

This work presents a simple deep neural network that competes with and, in some cases, outperforms such models on sentiment analysis and factoid question answering tasks while taking only a fraction of the training time.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

It is found that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster, and a new QA-pair retriever, RePAZ, is introduced to complement PAQ.

Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Simple and fully general methods for converting from contextualized representations to static lookup-table embeddings are introduced which are applied to 5 popular pretrained models and 9 sets of pretrained weights and reveal that pooling over many contexts significantly improves representational quality under intrinsic evaluation.