• Corpus ID: 15582471

A Deep Neural Network Compression Pipeline: Pruning, Quantization, Huffman Encoding

@inproceedings{Han2015ADN,
  title={A Deep Neural Network Compression Pipeline: Pruning, Quantization, Huffman Encoding},
  author={Song Han and Huizi Mao and William J. Dally},
  year={2015}
}
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. [] Key Method Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids.
Deep Neural Network Compression Method Based on Product Quantization
TLDR
A method based on the combination of product quantization and pruning to compress deep neural network with large size model and great amount of calculation is proposed, which reduces the storage overhead so that theDeep neural network can be deployed in embedded devices.
Compact Deep Convolutional Neural Networks With Coarse Pruning
TLDR
A simple and generic strategy to choose the least adversarial pruning masks for both granularities is proposed and it is shown that more than 85% sparsity can be induced in the convolution layers with less than 1% increase in the missclassification rate of the baseline network.
Scalpel: Customizing DNN pruning to the underlying hardware parallelism
TLDR
This work implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results, including mean speedups of 3.54x, 2.61x, and 1.25x while reducing the model sizes by 88, 82%, and 53%.
Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
TLDR
A simple and effective scheme to compress the entire CNN, called one-shot whole network compression, which addresses the important implementation level issue on 1?1 convolution, which is a key operation of inception module of GoogLeNet as well as CNNs compressed by the proposed scheme.
Pruning Filters for Efficient ConvNets
TLDR
This work presents an acceleration method for CNNs, where it is shown that even simple filter pruning techniques can reduce inference costs for VGG-16 and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.
Compressing Convolutional Neural Networks in the Frequency Domain
TLDR
This paper presents a novel net- work architecture, Frequency-Sensitive Hashed Nets (FreshNets), which exploits inherent redundancy in both convolutional layers and fully-connected layers of a deep learning model, leading to dramatic savings in memory and storage consumption.
Deep Neural Network Approximation using Tensor Sketching
TLDR
This work focuses on deep convolutional neural network architectures, and proposes a novel randomized tensor sketching technique that is utilized to develop a unified framework for approximating the operation of both the Convolutional and fully connected layers.
Pruning Deep Convolutional Neural Networks for Fast Inference
TLDR
This dissertation has proposed pruning and fixed-point optimization techniques to reduce the computational complexity of deep neural networks.
SEP-Nets: Small and Effective Pattern Networks
TLDR
This paper proposes a simple yet powerful method for compressing the size of deep CNNs based on parameter binarization and proposes a new block structure codenamed the pattern residual block that adds transformed feature maps generated by convolutional neural networks to the pattern feature mapsgenerated by convolutions, based on which a small network with $\sim 1$ million parameters is designed.
Fixed Point Quantization of Deep Convolutional Networks
TLDR
This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Learning both Weights and Connections for Efficient Neural Network
TLDR
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.
Compressing Deep Convolutional Networks using Vector Quantization
TLDR
This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.
Memory Bounded Deep Convolutional Networks
TLDR
This work investigates the use of sparsity-inducing regularizers during training of Convolution Neural Networks and shows that training with such regularization can still be performed using stochastic gradient descent implying that it can be used easily in existing codebases.
Improving the speed of neural networks on CPUs
TLDR
This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.
Compressing Neural Networks with the Hashing Trick
TLDR
This work presents a novel network architecture, HashedNets, that exploits inherent redundancy in neural networks to achieve drastic reductions in model sizes, and demonstrates on several benchmark data sets that HashingNets shrink the storage requirements of neural networks substantially while mostly preserving generalization performance.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
TLDR
Using large state-of-the-art models, this work demonstrates speedups of convolutional layers on both CPU and GPU by a factor of 2 x, while keeping the accuracy within 1% of the original model.
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Caffe: Convolutional Architecture for Fast Feature Embedding
TLDR
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
...
...