Hierarchical Self-supervised Augmented Knowledge Distillation

  title={Hierarchical Self-supervised Augmented Knowledge Distillation},
  author={Chuanguang Yang and Zhulin An and Linhang Cai and Yongjun Xu},
Knowledge distillation often involves how to define and transfer knowledge from teacher to student effectively. Although recent self-supervised contrastive knowledge achieves the best performance, forcing the network to learn such knowledge may damage the representation learning of the original class recognition task. We therefore adopt an alternative self-supervised augmented task to guide the network to learn the joint distribution of the original recognition task and self-supervised… 

Figures and Tables from this paper

Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution

Experiments on standard image classification show that both offline and online HSSAKD achieves state-of-the-art performance in the field of KD, and transfer experiments on object detection further verify that HSsaKD can guide the network to learn better features.

MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition

This paper proposes to perform Self-KD from image Mixture (MixSKD), which integrates these two techniques into a unified framework and constructs a self-teacher network by aggregating multi-stage feature maps for providing soft labels to supervise the backbone classifier, further improving the efficacy of self-boosting.

InDistill: Transferring Knowledge From Pruned Intermediate Layers

This paper proposes a novel method, termed InDistill, that can drastically improve the performance of existing single-layer knowledge distillation methods by leveraging the properties of channel pruning to both reduce the capacity gap between the models and retain the architectural alignment.

Information Theoretic Representation Distillation

This work forge an alternative connection between information theory and knowledge distillation using a recently proposed entropy-like functional and introduces two distinct complementary losses which aim to maximise the correlation and mutual information between the student and teacher representations.

Cross-Image Relational Knowledge Distillation for Semantic Segmentation

A novel Cross-Image Relational KD that makes the student mimic better structured semantic relations from the teacher, thus improving the segmentation performance, and demonstrates the effectiveness of the proposed approach against state-of-the-art distillation methods.

Mutual Contrastive Learning for Visual Representation Learning

Experimental results on image classification and transfer learning to object detection show that MCL can lead to consistent performance gains, demonstrating that M CL can guide the network to generate better feature representations.

Proto2Proto: Can you recognize the car, the way I do?

Proto2Proto, a novel method to transfer interpretability of one prototypical part network to another via knowledge distillation, and proposes three novel metrics to evaluate the student’s proximity to the teacher as measures of interpretability transfer in settings.

HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors

The HEtero-Assists Distillation (HEAD) framework is proposed, leveraging heterogeneous detection heads as assistants to guide the optimization of the student detector to reduce the significant semantic gap between the backbone features of heterogeneous detectors.

Localizing Semantic Patches for Accelerating Image Classification

This paper proposes an efficient image classification pipeline that first pinpoint task-aware regions over the input image by a lightweight patch proposal network called AnchorNet, and then feeds these localized semantic patches with much smaller spatial redundancy into a general classification network.



Knowledge Distillation Meets Self-Supervision

A more general and model-agnostic approach for extracting "richer dark knowledge" from the pre-trained teacher model, and it is shown that the seemingly different self-supervision task can serve as a simple yet powerful solution for distillation.

Contrastive Representation Distillation

The resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer.

Variational Information Distillation for Knowledge Transfer

An information-theoretic framework for knowledge transfer is proposed which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks and which consistently outperforms existing methods.

Self-supervised Label Augmentation via Input Transformations

This paper proposes a novel knowledge transfer technique, which it refers to as self-distillation, that has the effect of the aggregated inference in a single (faster) inference and demonstrates the large accuracy improvement and wide applicability of the framework on various fully-supervised settings.

Similarity-Preserving Knowledge Distillation

This paper proposes a new form of knowledge distillation loss that is inspired by the observation that semantically similar inputs tend to elicit similar activation patterns in a trained network.

Relational Knowledge Distillation

RKD allows students to outperform their teachers' performance, achieving the state of the arts on standard benchmark datasets and proposes distance-wise and angle-wise distillation losses that penalize structural differences in relations.

A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning

A novel technique for knowledge transfer, where knowledge from a pretrained deep neural network (DNN) is distilled and transferred to another DNN, which shows the student DNN that learns the distilled knowledge is optimized much faster than the original model and outperforms the original DNN.

Multi-View Contrastive Learning for Online Knowledge Distillation

This work proposes Multi-view Contrastive Learning (MCL) for OKD to implicitly capture correlations of feature embeddings encoded by multiple peer networks, which provide various views for understanding the input data instances and can learn a more discriminative representation space for classification.

Heterogeneous Knowledge Distillation Using Information Flow Modeling

This paper proposes a novel KD method that works by modeling the information flow through the various layers of the teacher model and then training a student model to mimic this information flow.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.