Wasserstein Contrastive Representation Distillation

  title={Wasserstein Contrastive Representation Distillation},
  author={Liqun Chen and Zhe Gan and Dong Wang and Jingjing Liu and Ricardo Henao and Lawrence Carin},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Liqun Chen, Zhe Gan, L. Carin
  • Published 15 December 2020
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. Existing work, e.g., using Kullback-Leibler divergence for distillation, may fail to capture important structural knowledge in the teacher network and often lacks the ability for feature generalization, particularly in situations when teacher and student are built to address different… 

Figures and Tables from this paper

Information Theoretic Representation Distillation

This work forge an alternative connection between information theory and knowledge distillation using a recently proposed entropy-like functional and introduces two distinct complementary losses which aim to maximise the correlation and mutual information between the student and teacher representations.

Network Binarization via Contrastive Learning

The experimental results show that the method can be implemented as a pile-up module on existing state-of-the-art binarization methods and can remarkably improve the performance over them on CIFAR-10/100 and ImageNet, in addition to the great generalization ability on NYUD-v2.

Faculty Distillation with Optimal Transport

This work proposes to link teacher’s task and students’ task by optimal transport based on the semantic relationship between their label spaces, and can bridge the support gap between output distributions by minimizing Sinkhorn distances.

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning and achieves strong performance with only 3M image text pairs.

DeepWSD: Projecting Degradations in Perceptual Space to Wasserstein Distance in Deep Feature Space

The deep Wasserstein distance (DeepWSD) performed on features from neural networks enjoys better interpretability of the quality contamination caused by various types of distortions and presents an advanced quality prediction capability.

Learning from Students: Online Contrastive Distillation Network for General Continual Learning

An Online Contrastive Distillation Network (OCD-Net) is proposed, which explores the merit of the student model in each time step to guide the training process of theStudent model to consolidate the learned knowledge.

Knowledge Condensation Distillation

The knowledge value on each sample is dynamically estimated, based on which an Expectation-Maximization (EM) framework is forged to iteratively condense a compact knowledge set from the teacher to guide the student learning.

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

A modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy is proposed, which leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised way.

Contrastive Deep Supervision

A novel training framework named Contrastive Deep Supervision is proposed, which supervises the intermediate layers with augmentation-based contrastive learning, which has effects on general image classification, fine-grained image classification and object detection in supervised learning, semi-supervised learning and knowledge distillation.

Contrastive Information Transfer for Pre-Ranking Systems

A new Contrastive Information Transfer (CIT) framework is proposed to transfer useful information from ranking model to pre-ranking model, which has the advantage of alleviating selection bias and improving the performance of recall metrics, which is crucial for pre- ranking models.



Contrastive Representation Distillation

The resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer.

Similarity-Preserving Knowledge Distillation

This paper proposes a new form of knowledge distillation loss that is inspired by the observation that semantically similar inputs tend to elicit similar activation patterns in a trained network.

Revisiting Knowledge Distillation via Label Smoothing Regularization

It is argued that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important.

Contrastive Distillation on Intermediate Representations for Language Model Compression

CoDIR is proposed, a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.

Patient Knowledge Distillation for BERT Model Compression

This work proposes a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student), which translates into improved results on multiple NLP tasks with a significant gain in training efficiency, without sacrificing model accuracy.

Wasserstein Dependency Measure for Representation Learning

It is empirically demonstrated that mutual information-based representation learning approaches do fail to learn complete representations on a number of designed and real-world tasks, and a practical approximation to this theoretically motivated solution, constructed using Lipschitz constraint techniques from the GAN literature, achieves substantially improved results on tasks where incomplete representations are a major challenge.

Knowledge Distillation by On-the-Fly Native Ensemble

This work presents an On-the-fly Native Ensemble strategy for one-stage online distillation that improves the generalisation performance a variety of deep neural networks more significantly than alternative methods on four image classification dataset.

On Mutual Information Maximization for Representation Learning

This paper argues, and provides empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators.

FitNets: Hints for Thin Deep Nets

This paper extends the idea of a student network that could imitate the soft output of a larger teacher network or ensemble of networks, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student.

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes and shows that learned representations can reduce (labeled) sample complexity on downstream tasks.