Robust Active Distillation

  title={Robust Active Distillation},
  author={Cenk Baykal and Khoa Trinh and Fotis Iliopoulos and Gaurav Menghani and Erik Vee},
Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited amount of labeled data is available. In large-scale applications, however, the teacher tends to provide a large number of incorrect soft-labels that impairs student performance. The sheer size of the teacher additionally constrains the number of soft-labels that can be queried due to prohibitive… 
1 Citations

Understanding Self-Distillation in the Presence of Label Noise

Self-distillation is theoretically characterized in two supervised learning problems with noisy labels and it is shown that in the high label noise regime, the optimal value of ξ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1.



Batch Active Learning at Scale

This work analyzes an efficient active learning algorithm, which focuses on the large batch setting, and shows that its sampling method easily scales to batch sizes several orders of magnitude larger than used in previous studies and provides significant improvements in model training efficiency compared to recent baselines.

Knowledge Distillation: A Survey

A comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher–student architecture, distillation algorithms, performance comparison and applications is provided.

Does Knowledge Distillation Really Work?

It is shown how the details of the dataset used for distillation play a role in how closely the student matches the teacher — and that more closely matching the teacher paradoxically does not always lead to better student generalization.

Certainty driven consistency loss on multi-teacher networks for semi-supervised learning

QActor: Active Learning on Noisy Labels

A noise-aware active learning framework, QActor, and a novel measure CENT, which considers both cross-entropy and entropy to select informative and noisy labels for an expert cleansing, which can nearly match the optimal accuracy achieved using only clean data at the cost of only an additional 10% of ground truth data from the oracle.

Knowledge distillation: A good teacher is patient and consistent

It is demonstrated that, when performed correctly, knowledge distillation can be a powerful tool for reducing the size of large models without compromising their performance.

Distilling Effective Supervision From Severe Label Noise

This paper presents a holistic framework to train deep neural networks in a way that is highly invulnerable to label noise and achieves excellent performance on large-scale datasets with real-world label noise.

On the Efficacy of Knowledge Distillation

It is found crucially that larger models do not often make better teachers, and that small students are unable to mimic large teachers.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.