Adaptive multi-teacher multi-level knowledge distillation

  title={Adaptive multi-teacher multi-level knowledge distillation},
  author={Yuang Liu and W. Zhang and Jun Wang},

Figures and Tables from this paper

Confidence-Aware Multi-Teacher Knowledge Distillation

Confidence-Aware Multi-teacher Knowledge Distillation (CA-MKD) is proposed, which adaptively assigns sample-wise reliability for each teacher prediction with the help of ground-truth labels, with those teacher predictions close to one-hot labels assigned large weights.

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

This paper designs a multi-teacher co-finetuning method to jointly finetune multiple teacher PLMs in downstream tasks with shared pooling and prediction layers to align their output space for better collaborative teaching and proposes aMulti-Teacher knowledge distillation framework named MTBERT for pre-trained language model compression, which can train high-quality student model from multiple teacherPLMs.

ALM-KD: Knowledge Distillation with noisy labels via adaptive loss mixing

This work learns an instance-specific convex combination of the teacher-matching and label supervision objectives, using meta learning on a validation metric signalling to the student ‘how much’ of KD is to be used.

Teacher-Class Network: A Neural Network Compression Mechanism

The proposed teacher-class network consisting of a single teacher and multiple student networks outperforms the state-of-the-art single student approach in terms of accuracy as well as computational cost and in many cases it achieves an accuracy equivalent to the teacher network while having 10-30 times fewer parameters.

Multi-Knowledge Aggregation and Transfer for Semantic Segmentation

A novel multi-knowledge aggregation and transfer (MKAT) framework to comprehensively distill knowledge within an intermediate layer for semantic segmentation is proposed, showing that MKAT outperforms the other KD methods.

PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation

The results show that hint points selected by the proposed algorithm results in superior compression performance with respect to state-of-the-art knowledge distillation algorithms on the same student models and datasets.

Visualizing the embedding space to explain the effect of knowledge distillation

Two non-linear, low-dimensional embedding methods (t-SNE and IVIS) are utilized to visualize representation spaces of different layers in a network and clearly show that distillation guides the network to find a more compact representation space for higher accuracy already in earlier layers compared to its non-distilled version.

Data-Free Knowledge Transfer: A Survey

A comprehensive survey on data-free knowledge transfer from the perspectives of knowledge distillation and unsupervised domain adaptation is provided to help readers have a better understanding of the current research status and ideas.

Reweighing auxiliary losses in supervised learning

This work introduces Amal which learns instance-specific weights using meta learning on a validation metric to achieve optimal mixing of losses and empirically analyze the method and share insights into the mechanisms through which it provides performance gains.



Learning from Multiple Teacher Networks

This paper presents a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples.

Relational Knowledge Distillation

RKD allows students to outperform their teachers' performance, achieving the state of the arts on standard benchmark datasets and proposes distance-wise and angle-wise distillation losses that penalize structural differences in relations.

Deep Mutual Learning

Surprisingly, it is revealed that no prior powerful teacher network is necessary - mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.

Learning to Specialize with Knowledge Distillation for Visual Question Answering

This work presents a principled algorithm to learn specialized models with knowledge distillation under a multiple choice learning (MCL) framework, where training examples are assigned dynamically to a subset of models for updating network parameters.

Efficient Knowledge Distillation from an Ensemble of Teachers

It is shown that with knowledge distillation, information from multiple acoustic models like very deep VGG networks and Long Short-Term Memory models can be used to train standard convolutional neural network (CNN) acoustic models for a variety of systems requiring a quick turnaround.

A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning

A novel technique for knowledge transfer, where knowledge from a pretrained deep neural network (DNN) is distilled and transferred to another DNN, which shows the student DNN that learns the distilled knowledge is optimized much faster than the original model and outperforms the original DNN.

FitNets: Hints for Thin Deep Nets

This paper extends the idea of a student network that could imitate the soft output of a larger teacher network or ensemble of networks, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student.

Learning With Single-Teacher Multi-Student

A new learning problem defined as "Single-Teacher Multi-Student" (STMS) problem, which investigates how to learn a series of student models from a single teacher (complex and universal) model, is studied and a gated support vector machine (gSVM) is proposed as a solution.

Knowledge Adaptation: Teaching to Adapt

This work shows how a student model achieves state-of-the-art results on unsupervised domain adaptation from multiple sources on a standard sentiment analysis benchmark by taking into account the domain-specific expertise of multiple teachers and the similarities between their domains.

Knowledge Distillation for Bilingual Dictionary Induction

A bridging approach to bilingual dictionary induction, where the main contribution is a knowledge distillation training objective, which allows seamless addition of teacher translation paths for any given low resource pair.