Follow Your Path: a Progressive Method for Knowledge Distillation

  title={Follow Your Path: a Progressive Method for Knowledge Distillation},
  author={Wenxian Shi and Yuxuan Song and Hao Zhou and Bohan Li and Lei Li},
  • Wenxian Shi, Yuxuan Song, +2 authors Lei Li
  • Published 2021
  • Computer Science
  • ArXiv
Deep neural networks often have huge number of parameters, which posts challenges in deployment in application scenarios with limited memory and computation capacity. Knowledge distillation is one approach to derive compact models from bigger ones. However, it has been observed that a converged heavy teacher model is strongly constrained for learning a compact student network and could make the optimization subject to poor local optima. In this paper, we propose ProKT, a new modelagnostic… Expand

Figures and Tables from this paper


Knowledge Distillation via Route Constrained Optimization
This work considers the knowledge distillation from the perspective of curriculum learning by teacher's routing, and finds that the representation of a converged heavy model is still a strong constraint for training a small student model, which leads to a higher lower bound of congruence loss. Expand
Model compression via distillation and quantization
This paper proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks, and shows that quantized shallow students can reach similar accuracy levels to full-precision teacher models. Expand
Deep Mutual Learning
Surprisingly, it is revealed that no prior powerful teacher network is necessary - mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher. Expand
Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher
Multistep knowledge distillation is introduced which employs an intermediate-sized network (a.k.a. teacher assistant) to bridge the gap between the student and the teacher to alleviate the shortcoming of fixed student network performance. Expand
Contrastive Representation Distillation
The resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Expand
A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
A novel technique for knowledge transfer, where knowledge from a pretrained deep neural network (DNN) is distilled and transferred to another DNN, which shows the student DNN that learns the distilled knowledge is optimized much faster than the original model and outperforms the original DNN. Expand
Patient Knowledge Distillation for BERT Model Compression
This work proposes a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student), which translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy. Expand
Generative Bridging Network for Neural Sequence Prediction
Three different GBNs, namely uniform GBN, language-model GBN and coaching GBN are proposed to penalize confidence, enhance language smoothness and relieve learning burden, show that they can yield significant improvements over strong baselines. Expand
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
It is shown that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Expand
Distilling the Knowledge in a Neural Network
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Expand