Distilling Knowledge by Mimicking Features
@article{Wang2020DistillingKB, title={Distilling Knowledge by Mimicking Features}, author={G. Wang and Yifan Ge and Jianxin Wu}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, year={2020}, volume={44}, pages={8183-8195} }
Knowledge distillation (KD) is a popular method to train efficient networks (“student”) with the help of high-capacity networks (“teacher”). Traditional methods use the teacher’s soft logits as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher’s features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be…
Figures and Tables from this paper
9 Citations
IDa-Det: An Information Discrepancy-aware Distillation for 1-bit Detectors
- Computer ScienceECCV
- 2022
This paper presents an Information Discrepancy-aware strategy (IDa-Det) to distill 1-bit detectors that can effectively eliminate information discrepancies and significantly reduce the performance gap between a 1- bit detector and its real-valued counterpart.
Compressing Models with Few Samples: Mimicking then Replacing
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper proposes a new framework named Mimicking then Replacing (MiR) for few-sample compression, which firstly urges the pruned model to output the same features as the teacher's in the penultimate layer, and then replaces teacher's layers before penultimate with a well-tuned compact one.
Practical Network Acceleration with Tiny Sets
- Computer ScienceArXiv
- 2022
This paper reveals that dropping blocks is a fundamentally superior approach in this scenario and proposes an algorithm named PRACTISE to accelerate networks using only tiny sets of training images, which outperforms previous methods by a significant margin.
Practical Network Acceleration with Tiny Sets: Hypothesis, Theory, and Algorithm
- Computer Science
- 2023
This paper proposes the finetune convexity hypothesis to explain why recent few-shot compression algorithms do not suffer from overfitting problems and claims dropping blocks is a fundamentally superior few- shot compression scheme in terms of more convex optimization and a higher acceleration ratio.
Towards Efficient Post-training Quantization of Pre-trained Language Models
- Computer ScienceArXiv
- 2021
This paper study's post-training quantization (PTQ) of PLMs is studied, and a module-wise quantization error minimization (MREM) solution is proposed, an efficient solution to mitigate these issues.
Skeleton-based Action Recognition via Adaptive Cross-Form Learning
- Computer ScienceACM Multimedia
- 2022
Adaptive Cross-Form Learning (ACFL) is presented, which empowers well-designed GCNs to generate complementary representation from single-form skeletons without changing model capacity, and significantly improves various GCN models, achieving a new record for skeleton-based action recognition.
Guided Hybrid Quantization for Object detection in Multimodal Remote Sensing Imagery via One-to-one Self-teaching
- Computer ScienceArXiv
- 2023
This work designs a structure called guided quantization self-distillation (GQSD), which is an innovative idea for realizing lightweight through the synergy of quantization and distillation and proposes a one-to-one self-teaching module to give the student network a ability of self-judgment.
EVC: Towards Real-Time Neural Image Compression with Mask Decay
- Computer ScienceArXiv
- 2023
Both mask decay and residual representation learning greatly improve the RD performance of the scalable encoder, which significantly narrows the performance gap by 50% and 30% for the medium and small models, respectively.
Localization Distillation for Object Detection
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2023
This paper presents a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student and introduces the concept of valuable localization region that can aid to selectively distill the classification and localization knowledge for a certain region.
References
SHOWING 1-10 OF 40 REFERENCES
Feature Normalized Knowledge Distillation for Image Classification
- Computer ScienceECCV
- 2020
This work systematically analyzes the distillation mechanism and proposes a simple yet effective feature normalized knowledge distillation which introduces the sample specific correction factor to replace the unified temperature T for better reducing the impact of noise.
Preparing Lessons: Improve Knowledge Distillation with Better Supervision
- Computer ScienceNeurocomputing
- 2021
Prime-Aware Adaptive Distillation
- Computer ScienceECCV
- 2020
The adaptive sample weighting to KD is introduced and Prime-Aware Adaptive Distillation (PAD) is proposed by the incorporation of uncertainty learning by perceiving the prime samples in distillation and then emphasizes their effect adaptively.
A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A novel technique for knowledge transfer, where knowledge from a pretrained deep neural network (DNN) is distilled and transferred to another DNN, which shows the student DNN that learns the distilled knowledge is optimized much faster than the original model and outperforms the original DNN.
Contrastive Representation Distillation
- Computer ScienceICLR
- 2020
The resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer.
Few Sample Knowledge Distillation for Efficient Network Compression
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper proposes a novel solution for knowledge distillation from label-free few samples to realize both data efficiency and training/processing efficiency and can recover student-net's accuracy to the same level as conventional fine-tuning methods in minutes while using only 1% label- free data of the full training data.
Knowledge Distillation by On-the-Fly Native Ensemble
- Computer ScienceNeurIPS
- 2018
This work presents an On-the-fly Native Ensemble strategy for one-stage online distillation that improves the generalisation performance a variety of deep neural networks more significantly than alternative methods on four image classification dataset.
Distilling Object Detectors With Fine-Grained Feature Imitation
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A fine-grained feature imitation method exploiting the cross-location discrepancy of feature response on the near object anchor locations reveals important information of how teacher model tends to generalize.
Paraphrasing Complex Network: Network Compression via Factor Transfer
- Computer ScienceNeurIPS
- 2018
A novel knowledge transfer method which uses convolutional operations to paraphrase teacher's knowledge and to translate it for the student and observes that the student network trained with the proposed factor transfer method outperforms the ones trained with conventional knowledge transfer methods.
Correlation Congruence for Knowledge Distillation
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
A new framework named correlation congruence for knowledge distillation (CCKD), which transfers not only the instance-level information but also the correlation between instances, and a generalized kernel method based on Taylor series expansion is proposed to better capture the correlationBetween instances.