Corpus ID: 201666324

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

@article{Turc2019WellReadSL,
  title={Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation},
  author={Iulia Turc and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.08962}
}
Recent developments in NLP have been accompanied by large, expensive models. [...] Key Result Extensive ablation studies dissect the interaction between pre-training and distillation, revealing a compound effect even when they are applied on the same unlabeled dataset.Expand
54 Citations

Paper Mentions

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding
DoT: An efficient Double Transformer for NLP tasks with tables
Improving Task-Agnostic BERT Distillation with Layer Mapping Search
DiPair: Fast and Accurate Distillation for Trillion-ScaleText Matching and Pair Modeling
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Attention-Guided Answer Distillation for Machine Reading Comprehension
Sequence-Level Knowledge Distillation
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
Language Models are Unsupervised Multitask Learners
Transformer to CNN: Label-scarce distillation for efficient text classification
FitNets: Hints for Thin Deep Nets
A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
Improving Language Understanding by Generative Pre-Training
...
1
2
3
4
...