• Corpus ID: 203626972

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

@article{Sanh2019DistilBERTAD,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.01108}
}
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior… 

Tables from this paper

TinyBERT: Distilling BERT for Natural Language Understanding

A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.

RobBERTje: a Distilled Dutch BERT Model

This paper creates several distilled versions of the state-of-the-art Dutch RobBERT model and finds that the larger DistilBERT architecture worked significantly better than the Bort hyperparametrization and also found that the distilled models exhibit less gender-stereotypical bias than its teacher model.

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

This work proposes MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed, and proposes a layer-wise distillation method to train MoEbered, which outperforms existing task-specific distillation algorithms.

MixKD: Towards Efficient Distillation of Large-scale Language Models

MixKD is proposed, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability, and it is proved, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error.

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

This work proposes a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect, which is not considered by the existing distillation methods.

Understanding BERT Rankers Under Distillation

If and how the knowledge for search within BERT can be transferred to a smaller ranker through distillation is studied, which produces up to nine times speedup while preserving the state-of-the-art performance.

Towards Effective Utilization of Pre-trained Language Models

This thesis proposes MKD, a Multi-Task Knowledge Distillation Approach, where a large pretrained model serves as teacher and transfers its knowledge to a small student model, which distills the student model from different tasks jointly, so that the distilled model learns a more universal language representation by leveraging cross-task data.

RefBERT: Compressing BERT by Referencing to Pre-computed Representations

RefBERT is proposed to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model, which is 7.4x smaller and 9.5x faster on inference than BERTBASE.

Poor Man's BERT: Smaller and Faster Transformer Models

A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.

Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging

This work adapts Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to PLMs and demonstrates that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.
...

References

SHOWING 1-10 OF 21 REFERENCES

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

It is observed that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System

A Multi-task Knowledge Distillation Model (MKDM for short) for web-scale Question Answering system is proposed, by distilling knowledge from multiple teacher models to a light-weight student model, in this way, more generalized knowledge can be transferred.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

Are Sixteen Heads Really Better than One?

It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Transformers: State-of-the-Art Natural Language Processing

Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.