Differentiable Feature Aggregation Search for Knowledge Distillation

  title={Differentiable Feature Aggregation Search for Knowledge Distillation},
  author={Yushuo Guan and Pengyu Zhao and Bingxuan Wang and Yuanxing Zhang and Cong Yao and Kaigui Bian and Jian Tang},
Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency… 

Knowledge distillation via softmax regression representation learning

This paper addresses the problem of model compression via knowledge distillation with a direct feature matching approach which focuses on optimizing the student’s penultimate layer only and a second approach that decouples representation learning and classification and utilizes the teacher’'s pre-trained classifier to train the student's pen ultimate layer feature.

Channel-wise Knowledge Distillation for Dense Prediction*

This work proposes to normalize the activation map of each channel to obtain a soft probability map and demonstrates that the proposed method outperforms state-of-the-art distillation methods considerably, and can require less computational cost during training.

Collaborative Teacher-Student Learning via Multiple Knowledge Transfer

A collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT) that prompts both self-learning and collaborative learning that significantly outperforms the state-of-the-art KD methods.

Attention-based Knowledge Distillation in Multi-attention Tasks: The Impact of a DCT-driven Loss

By using global image cues rather than pixel estimates, this strategy enhances knowledge transferability in tasks such as scene recognition, leading to better descriptive features and higher transferred performance than every other state-of-the-art alternative.

Cross-Layer Distillation with Semantic Calibration

Semantic Calibration for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer with an attention mechanism, demonstrating the effectiveness and flexibility of the proposed attention based soft layer association mechanism for cross-layer distillation.

Channel-wise Distillation for Semantic Segmentation

This paper proposes to align features channel-wise between the student and teacher networks' feature maps in the spatial domain by first transforming the feature map of each channel into a distribution using softmax normalization, and minimizing the Kullback-Leibler divergence of the corresponding channels of the two networks.

Impact of a DCT-driven Loss in Attention-based Knowledge-Distillation for Scene Recognition

Experimental results provide strong evidences that the proposed strategy enables the student network to better focus on the relevant image areas learnt by the teacher network, hence leading to better descriptive features and higher transferred performance than every other state-of-the-art alternative.

Supplementary Materials: Channel-wise Knowledge Distillation for Dense Prediction

  • Computer Science
  • 2021
To further demonstrate the effectiveness of the proposed channel distribution distillation, the proposed CD is employed on the feature maps as the authors' final results on Pascal VOC and ADE20K to demonstrate that CD works better than other structural knowledge distillation methods.

Investigating Bi-Level Optimization for Learning and Vision from a Unified Perspective: A Survey and Beyond

A best-response-based single-level reformulation is constructed and a unified algorithmic framework to understand and formulate mainstream gradient-based BLO methodologies are established, covering aspects ranging from fundamental automatic differentiation schemes to various accelerations, simplifications, extensions and their convergence and complexity properties.

Multi-Person Pose Estimation on Embedded Device

This paper performs model compression and acceleration in multi-person pose estimation by replacing the feature extraction network, parameter pruning and knowledge distillation with a compressed model that achieves a 25% drop in accuracy compared with a lightweight model.



Towards Oracle Knowledge Distillation with Neural Architecture Search

It is shown that searching for a new student model is effective in both accuracy and memory size and that the searched models often outperform their teacher models thanks to neural architecture search with oracle knowledge distillation.

Similarity-Preserving Knowledge Distillation

This paper proposes a new form of knowledge distillation loss that is inspired by the observation that semantically similar inputs tend to elicit similar activation patterns in a trained network.

Block-Wisely Supervised Neural Architecture Search With Knowledge Distillation

This work proposes to modularize the large search space of NAS into blocks to ensure that the potential candidate architectures are fully trained, and distill the neural architecture (DNA) knowledge from a teacher model to supervise the block-wise architecture search, which significantly improves the effectiveness of NAS.

A Comprehensive Overhaul of Feature Distillation

A novel feature distillation method in which the distillation loss is designed to make a synergy among various aspects: teacher transform, student transform, distillation feature position and distance function, which achieves a significant performance improvement in all tasks.

Learning from Multiple Teacher Networks

This paper presents a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples.

FitNets: Hints for Thin Deep Nets

This paper extends the idea of a student network that could imitate the soft output of a larger teacher network or ensemble of networks, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student.

Efficient Neural Architecture Search via Parameter Sharing

Efficient Neural Architecture Search is a fast and inexpensive approach for automatic model design that establishes a new state-of-the-art among all methods without post-training processing and delivers strong empirical performances using much fewer GPU-hours.

Neural Architecture Search with Reinforcement Learning

This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.

SNAS: Stochastic Neural Architecture Search

It is proved that this search gradient optimizes the same objective as reinforcement-learning-based NAS, but assigns credits to structural decisions more efficiently, and is further augmented with locally decomposable reward to enforce a resource-efficient constraint.

Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons

This paper proposes a knowledge transfer method via distillation of activation boundaries formed by hidden neurons and proposes an activation transfer loss that has the minimum value when the boundaries generated by the student coincide with those by the teacher.