• Corpus ID: 7200347

Distilling the Knowledge in a Neural Network

  title={Distilling the Knowledge in a Neural Network},
  author={Geoffrey E. Hinton and Oriol Vinyals and Jeffrey Dean},
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. [] Key Method Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve…

Tables from this paper

Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition
This paper describes how to use knowledge distillation to combine acoustic models in a way that improves recognition accuracy significantly, can be implemented with standard training tools, and requires no additional complexity during recognition.
Distilling Model Knowledge
This thesis presents a general framework for knowledge distillation, whereby a convenient model of the authors' choosing learns how to mimic a complex model, by observing the latter's behaviour and being penalized whenever it fails to reproduce it.
Essence Knowledge Distillation for Speech Recognition
This paper proposes to distill the knowledge of essence in an ensemble of models to a single model that needs much less computation to deploy and trains the student model with a multitask learning approach by utilizing both the soften outputs of the teacher model and the correct hard labels.
Distillation of Deep Learning Ensembles as a Regularisation Method
It is shown that an ensemble of deep neural networks can indeed be approximated by a single deep neural network with size and capacity equal to the single ensemble member, and a recipe is developed that shows how this can be achieved without using any artificial training data or any other special provisions.
Model Fusion via Optimal Transport
This work presents a layer-wise model fusion algorithm for neural networks that utilizes optimal transport to (soft-) align neurons across the models before averaging their associated parameters, and shows that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
Knowledge distillation for small-footprint highway networks
This paper significantly improved the recognition accuracy of the HDNN acoustic model with less than 0.8 million parameters, and narrowed the gap between this model and the plain DNN with 30 million parameters.
Rapid Training of Very Large Ensembles of Diverse Neural Networks
This work captures the structural similarity between members of a neural network ensemble and train it only once, and thereafter, this knowledge is transferred to all members of the ensemble using function-preserving transformations so that these ensemble networks converge significantly faster as compared to training from scratch.
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
It is proved that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy, and it sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems.
Efficient Knowledge Distillation from an Ensemble of Teachers
It is shown that with knowledge distillation, information from multiple acoustic models like very deep VGG networks and Long Short-Term Memory models can be used to train standard convolutional neural network (CNN) acoustic models for a variety of systems requiring a quick turnaround.
An Efficient Method of Training Small Models for Regression Problems with Knowledge Distillation
This paper proposes a new loss function, teacher outlier rejection loss, which rejects outliers in training samples using teacher model predictions, and considers a multi-task network with two outputs, which allows for better training of the feature extraction of student models.


Learning small-size DNN with output-distribution-based criteria
This study proposes to better address issues by utilizing the DNN output distribution and cluster the senones in the large set into a small one by directly relating the clustering process to DNN parameters, as opposed to decoupling the senone generation and DNN training process in the standard training.
Improving neural networks by preventing co-adaptation of feature detectors
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the
Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Model compression
This work presents a method for "compressing" large, complex ensembles into smaller, faster models, usually without significant loss in performance.
Adaptive Mixtures of Local Experts
A new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases, which is demonstrated to be able to be solved by a very simple expert network.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Model c ompression
  • InProceedings of the 12th ACM SIGKDD International Conference on Knowledge Disc overy and Data Mining, KDD ’06, pages 535–541, New York, NY, USA,
  • 2006
Deep neural networ ks for acoustic modeling in speech recognition: The shared views of four research group s
  • Signal Processing Magazine, IEEE,
  • 2012
Dropout: a simple way to prevent neural networks from overfitting
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
ImageNet classification with deep convolutional neural networks
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.