• Corpus ID: 214468602

Role-Wise Data Augmentation for Knowledge Distillation

@article{Fu2020RoleWiseDA,
  title={Role-Wise Data Augmentation for Knowledge Distillation},
  author={Jie Fu and Xue Geng and Zhijian Duan and Bohan Zhuang and Xingdi Yuan and Adam Trischler and Jie Lin and Christopher Joseph Pal and Hao Dong},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.08861}
}
Knowledge Distillation (KD) is a common method for transferring the ``knowledge'' learned by one machine learning model (the \textit{teacher}) into another model (the \textit{student}), where typically, the teacher has a greater capacity (e.g., more parameters or higher bit-widths). To our knowledge, existing methods overlook the fact that although the student absorbs extra knowledge from the teacher, both models share the same input data -- and this data is the only medium by which the teacher… 

Figures and Tables from this paper

Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation
TLDR
The proposed Universal-KD to match intermediate layers of the teacher and the student in the output space (by adding pseudo classifiers on intermediate layers) via the attention-based layer projection has three merits: it can be flexibly combined with current intermediate layer distillation techniques to improve their results, and the pseudo classify can be deployed instead of extra expensive teacher assistant networks to address the capacity gap problem in KD.
RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation
TLDR
The proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time and acts as a regularizer for improving the generalizability of the student model.
Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones
TLDR
This paper proposes to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models by only driving prediction of the student model consistent with that of the teacher model, and finds that such simple distillation settings perform extremely effective.
Channel-wise Knowledge Distillation for Dense Prediction*
TLDR
This work proposes to normalize the activation map of each channel to obtain a soft probability map and demonstrates that the proposed method outperforms state-of-the-art distillation methods considerably, and can require less computational cost during training.
RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation
TLDR
This work proposes a novel samplewise loss weighting method, RW-KD, where a metalearner, simultaneously trained with the student, adaptively re-weights the two losses for each sample.
Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax
TLDR
This work introduces MiniMax-kNN, a sample efficient data augmentation strategy tailored for Knowledge Distillation (KD), and exploits a semi-supervised approach based on KD to train a model on augmented data.
Channel-wise Distillation for Semantic Segmentation
TLDR
This paper proposes to align features channel-wise between the student and teacher networks' feature maps in the spatial domain by first transforming the feature map of each channel into a distribution using softmax normalization, and minimizing the Kullback-Leibler divergence of the corresponding channels of the two networks.
Knowledge Distillation with Noisy Labels for Natural Language Understanding
TLDR
This is the first study on KD with noisy labels in Natural Language Understanding (NLU) with the scope of the problem documented, and two methods to mitigate the impact of label noise are presented.
Supplementary Materials: Channel-wise Knowledge Distillation for Dense Prediction
  • Computer Science
  • 2021
TLDR
To further demonstrate the effectiveness of the proposed channel distribution distillation, the proposed CD is employed on the feature maps as the authors' final results on Pascal VOC and ADE20K to demonstrate that CD works better than other structural knowledge distillation methods.
Efficient Semantic Segmentation via Self-Attention and Self-Distillation
TLDR
This work proposes a tailored approach to efficient semantic segmentation by leveraging two complementary distillation schemes for supplementing context information to small networks: a self-attention distillation scheme, which transfers long-range context knowledge adaptively from large teacher networks to small student networks; and a layer-wise contextdistillation scheme.
...
...

References

SHOWING 1-10 OF 47 REFERENCES
Born Again Neural Networks
TLDR
This work studies KD from a new perspective: rather than compressing models, students are trained parameterized identically to their teachers, and shows significant advantages from transferring knowledge between DenseNets and ResNets in either direction.
Relational Knowledge Distillation
TLDR
RKD allows students to outperform their teachers' performance, achieving the state of the arts on standard benchmark datasets and proposes distance-wise and angle-wise distillation losses that penalize structural differences in relations.
Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher
TLDR
Multistep knowledge distillation is introduced which employs an intermediate-sized network (a.k.a. teacher assistant) to bridge the gap between the student and the teacher to alleviate the shortcoming of fixed student network performance.
Graph-based Knowledge Distillation by Multi-head Attention Network
TLDR
This paper proposes a novel method that enables distillation of dataset-based knowledge from the TN using an attention network, and the knowledge of the embedding procedure of the TN is distilled to graph by multi-head attention (MHA), and multi-task learning is performed to give relational inductive bias to the SN.
Improved Knowledge Distillation via Teacher Assistant
TLDR
Multi-step knowledge distillation is introduced, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher and study the effect of teacher assistant size and extend the framework to multi-step distillation.
Deep Mutual Learning
TLDR
Surprisingly, it is revealed that no prior powerful teacher network is necessary - mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.
Self-supervised Knowledge Distillation Using Singular Value Decomposition
TLDR
A new knowledge distillation using singular value decomposition (SVD) is proposed and outperforms the S-DNN driven by the state-of-the-art distillation with a performance advantage of 1.79%.
FitNets: Hints for Thin Deep Nets
TLDR
This paper extends the idea of a student network that could imitate the soft output of a larger teacher network or ensemble of networks, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student.
Learning to Compose Domain-Specific Transformations for Data Augmentation
TLDR
The proposed method can make use of arbitrary, non-deterministic transformation functions, is robust to misspecified user input, and is trained on unlabeled data, which can be used to perform data augmentation for any end discriminative model.
Distilling the Knowledge in a Neural Network
TLDR
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
...
...