A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning

  title={A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning},
  author={Junho Yim and Donggyu Joo and Ji-Hoon Bae and Junmo Kim},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Junho YimDonggyu Joo Junmo Kim
  • Published 21 July 2017
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
We introduce a novel technique for knowledge transfer, where knowledge from a pretrained deep neural network (DNN) is distilled and transferred to another DNN. As the DNN performs a mapping from the input space to the output space through many layers sequentially, we define the distilled knowledge to be transferred in terms of flow between layers, which is calculated by computing the inner product between features from two layers. When we compare the student DNN and the original network with… 

Figures and Tables from this paper

Variational Information Distillation for Knowledge Transfer

An information-theoretic framework for knowledge transfer is proposed which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks and which consistently outperforms existing methods.

Oracle Teacher: Towards Better Knowledge Distillation

This work introduces a new type of teacher model for KD, namely Oracle Teacher, that utilizes the embeddings of both the source inputs and the output labels to extract a more accurate knowledge to be transferred to the student.

Self-supervised Knowledge Distillation Using Singular Value Decomposition

A new knowledge distillation using singular value decomposition (SVD) is proposed and outperforms the S-DNN driven by the state-of-the-art distillation with a performance advantage of 1.79%.

Local Region Knowledge Distillation

Local linear region knowledge distillation (LRKD) is proposed which transfers the knowledge in local, liner regions from a teacher to a student and enforces the student to mimic the local shape of the teacher function in linear regions.

Knowledge Transfer via Dense Cross-Layer Mutual-Distillation

This paper proposes Dense Cross-layer Mutual-distillation (DCM), an improved two-way KT method in which the teacher and student networks are trained collaboratively from scratch, and introduces dense bidirectional KD operations between the layers appended with classifiers.

A Two-Teacher Framework for Knowledge Distillation

This work proposes a novel framework that consists of two teacher networks trained with different strategies, one is trained strictly to guide the student network to learn sophisticated features, and the other is trained loosely to guide it to learn general decision based on learned features.

Distillating Knowledge from Graph Convolutional Networks

This paper proposes a local structure preserving module that explicitly accounts for the topological semantics of the teacher GCN, and achieves the state-of-the-art knowledge distillation performance for GCN models.

QUEST: Quantized embedding space for transferring knowledge

This work proposes a novel way to achieve knowledge distillation: by distilling the knowledge through a quantized space, where the teacher's feature maps are quantized to represent the main visual concepts encompassed in the feature maps.

Generalized Knowledge Distillation via Relationship Matching

The knowledge of a well-trained deep neural network (a.k.a. the teacher) is valuable for learning similar tasks. Knowledge distillation extracts knowledge from the teacher and integrates it with the

Knowledge Distillation for Optimization of Quantized Deep Neural Networks

The experiments show that even a small teacher model can achieve the same distillation performance as a large teacher model and propose the gradual soft loss reduction (GSLR) technique which controls the mixing ratio of hard and soft losses during training for robust KD based QDNN optimization.



Net2Net: Accelerating Learning via Knowledge Transfer

The Net2Net technique accelerates the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network, and demonstrates a new state of the art accuracy rating on the ImageNet dataset.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction

The proposed network-joint network with the CNN for ImageQA and the parameter prediction network-is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU.

All you need is a good init

Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.

Understanding the difficulty of training deep feedforward neural networks

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

Wide Residual Networks

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.