Natural Statistics Of Network Activations And Implications For Knowledge Distillation

@article{Rotman2021NaturalSO,
  title={Natural Statistics Of Network Activations And Implications For Knowledge Distillation},
  author={Michael Rotman and Lior Wolf},
  journal={2021 IEEE International Conference on Image Processing (ICIP)},
  year={2021},
  pages={399-403}
}
  • Michael RotmanLior Wolf
  • Published 1 June 2021
  • Computer Science
  • 2021 IEEE International Conference on Image Processing (ICIP)
In a matter that is analogous to the study of natural image statistics, we study the natural statistics of the deep neural network activations at various layers. As we show, these statistics, similar to image statistics, follow a power law. We also show, both analytically and empirically, that with depth the exponent of this power law increases at a linear rate.As a direct implication of our discoveries, we present a method for performing Knowledge Distillation (KD). While classical KD methods… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 24 REFERENCES

How transferable are features in deep neural networks?

This paper quantifies the generality versus specificity of neurons in each layer of a deep convolutional neural network and reports a few surprising results, including that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

FitNets: Hints for Thin Deep Nets

This paper extends the idea of a student network that could imitate the soft output of a larger teacher network or ensemble of networks, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student.

Deep Pyramidal Residual Networks

This research gradually increases the feature map dimension at all units to involve as many locations as possible in the network architecture and proposes a novel residual unit capable of further improving the classification accuracy with the new network architecture.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

Wide Residual Networks

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.

Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons

This paper proposes a knowledge transfer method via distillation of activation boundaries formed by hidden neurons and proposes an activation transfer loss that has the minimum value when the boundaries generated by the student coincide with those by the teacher.

Deep Image Prior

It is shown that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

This work shows that, by properly defining attention for convolutional neural networks, this type of information can be used in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.