Distilling Knowledge by Mimicking Features

  title={Distilling Knowledge by Mimicking Features},
  author={G. Wang and Yifan Ge and Jianxin Wu},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  • G. WangYifan GeJianxin Wu
  • Published 3 November 2020
  • Computer Science
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
Knowledge distillation (KD) is a popular method to train efficient networks (“student”) with the help of high-capacity networks (“teacher”). Traditional methods use the teacher’s soft logits as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher’s features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be… 

IDa-Det: An Information Discrepancy-aware Distillation for 1-bit Detectors

This paper presents an Information Discrepancy-aware strategy (IDa-Det) to distill 1-bit detectors that can effectively eliminate information discrepancies and significantly reduce the performance gap between a 1- bit detector and its real-valued counterpart.

Compressing Models with Few Samples: Mimicking then Replacing

This paper proposes a new framework named Mimicking then Replacing (MiR) for few-sample compression, which firstly urges the pruned model to output the same features as the teacher's in the penultimate layer, and then replaces teacher's layers before penultimate with a well-tuned compact one.

Practical Network Acceleration with Tiny Sets

This paper reveals that dropping blocks is a fundamentally superior approach in this scenario and proposes an algorithm named PRACTISE to accelerate networks using only tiny sets of training images, which outperforms previous methods by a significant margin.

Practical Network Acceleration with Tiny Sets: Hypothesis, Theory, and Algorithm

This paper proposes the finetune convexity hypothesis to explain why recent few-shot compression algorithms do not suffer from overfitting problems and claims dropping blocks is a fundamentally superior few- shot compression scheme in terms of more convex optimization and a higher acceleration ratio.

Towards Efficient Post-training Quantization of Pre-trained Language Models

This paper study's post-training quantization (PTQ) of PLMs is studied, and a module-wise quantization error minimization (MREM) solution is proposed, an efficient solution to mitigate these issues.

Skeleton-based Action Recognition via Adaptive Cross-Form Learning

Adaptive Cross-Form Learning (ACFL) is presented, which empowers well-designed GCNs to generate complementary representation from single-form skeletons without changing model capacity, and significantly improves various GCN models, achieving a new record for skeleton-based action recognition.

Guided Hybrid Quantization for Object detection in Multimodal Remote Sensing Imagery via One-to-one Self-teaching

This work designs a structure called guided quantization self-distillation (GQSD), which is an innovative idea for realizing lightweight through the synergy of quantization and distillation and proposes a one-to-one self-teaching module to give the student network a ability of self-judgment.

EVC: Towards Real-Time Neural Image Compression with Mask Decay

Both mask decay and residual representation learning greatly improve the RD performance of the scalable encoder, which significantly narrows the performance gap by 50% and 30% for the medium and small models, respectively.

Localization Distillation for Object Detection

This paper presents a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student and introduces the concept of valuable localization region that can aid to selectively distill the classification and localization knowledge for a certain region.



Feature Normalized Knowledge Distillation for Image Classification

This work systematically analyzes the distillation mechanism and proposes a simple yet effective feature normalized knowledge distillation which introduces the sample specific correction factor to replace the unified temperature T for better reducing the impact of noise.

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Prime-Aware Adaptive Distillation

The adaptive sample weighting to KD is introduced and Prime-Aware Adaptive Distillation (PAD) is proposed by the incorporation of uncertainty learning by perceiving the prime samples in distillation and then emphasizes their effect adaptively.

A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning

A novel technique for knowledge transfer, where knowledge from a pretrained deep neural network (DNN) is distilled and transferred to another DNN, which shows the student DNN that learns the distilled knowledge is optimized much faster than the original model and outperforms the original DNN.

Contrastive Representation Distillation

The resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer.

Few Sample Knowledge Distillation for Efficient Network Compression

This paper proposes a novel solution for knowledge distillation from label-free few samples to realize both data efficiency and training/processing efficiency and can recover student-net's accuracy to the same level as conventional fine-tuning methods in minutes while using only 1% label- free data of the full training data.

Knowledge Distillation by On-the-Fly Native Ensemble

This work presents an On-the-fly Native Ensemble strategy for one-stage online distillation that improves the generalisation performance a variety of deep neural networks more significantly than alternative methods on four image classification dataset.

Distilling Object Detectors With Fine-Grained Feature Imitation

A fine-grained feature imitation method exploiting the cross-location discrepancy of feature response on the near object anchor locations reveals important information of how teacher model tends to generalize.

Paraphrasing Complex Network: Network Compression via Factor Transfer

A novel knowledge transfer method which uses convolutional operations to paraphrase teacher's knowledge and to translate it for the student and observes that the student network trained with the proposed factor transfer method outperforms the ones trained with conventional knowledge transfer methods.

Correlation Congruence for Knowledge Distillation

A new framework named correlation congruence for knowledge distillation (CCKD), which transfers not only the instance-level information but also the correlation between instances, and a generalized kernel method based on Taylor series expansion is proposed to better capture the correlationBetween instances.