DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
@article{Sanh2019DistilBERTAD, title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, journal={ArXiv}, year={2019}, volume={abs/1910.01108} }
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior…
3,027 Citations
TinyBERT: Distilling BERT for Natural Language Understanding
- Computer ScienceFINDINGS
- 2020
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.
RobBERTje: a Distilled Dutch BERT Model
- Computer ScienceArXiv
- 2022
This paper creates several distilled versions of the state-of-the-art Dutch RobBERT model and finds that the larger DistilBERT architecture worked significantly better than the Bort hyperparametrization and also found that the distilled models exhibit less gender-stereotypical bias than its teacher model.
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
- Computer ScienceNAACL
- 2022
This work proposes MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed, and proposes a layer-wise distillation method to train MoEbered, which outperforms existing task-specific distillation algorithms.
MixKD: Towards Efficient Distillation of Large-scale Language Models
- Computer ScienceICLR
- 2021
MixKD is proposed, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability, and it is proved, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error.
LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding
- Computer ScienceAAAI
- 2021
This work proposes a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect, which is not considered by the existing distillation methods.
Understanding BERT Rankers Under Distillation
- Computer ScienceICTIR
- 2020
If and how the knowledge for search within BERT can be transferred to a smaller ranker through distillation is studied, which produces up to nine times speedup while preserving the state-of-the-art performance.
Towards Effective Utilization of Pre-trained Language Models
- Computer Science
- 2020
This thesis proposes MKD, a Multi-Task Knowledge Distillation Approach, where a large pretrained model serves as teacher and transfers its knowledge to a small student model, which distills the student model from different tasks jointly, so that the distilled model learns a more universal language representation by leveraging cross-task data.
RefBERT: Compressing BERT by Referencing to Pre-computed Representations
- Computer Science2021 International Joint Conference on Neural Networks (IJCNN)
- 2021
RefBERT is proposed to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model, which is 7.4x smaller and 9.5x faster on inference than BERTBASE.
Poor Man's BERT: Smaller and Faster Transformer Models
- Computer ScienceArXiv
- 2020
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.
Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging
- Computer ScienceArXiv
- 2022
This work adapts Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to PLMs and demonstrates that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.
References
SHOWING 1-10 OF 21 REFERENCES
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
- Computer ScienceArXiv
- 2019
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.
Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation
- Computer ScienceArXiv
- 2019
It is observed that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks.
Language Models are Unsupervised Multitask Learners
- Computer Science
- 2019
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System
- Computer ScienceArXiv
- 2019
A Multi-task Knowledge Distillation Model (MKDM for short) for web-scale Question Answering system is proposed, by distilling knowledge from multiple teacher models to a light-weight student model, in this way, more generalized knowledge can be transferred.
Distilling the Knowledge in a Neural Network
- Computer ScienceArXiv
- 2015
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
Are Sixteen Heads Really Better than One?
- Computer ScienceNeurIPS
- 2019
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Computer ScienceArXiv
- 2019
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Attention is All you Need
- Computer ScienceNIPS
- 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Transformers: State-of-the-Art Natural Language Processing
- Computer ScienceEMNLP
- 2020
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
- Computer ScienceBlackboxNLP@EMNLP
- 2018
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.