Deep compression and EIE: Efficient inference engine on compressed deep neural network

  title={Deep compression and EIE: Efficient inference engine on compressed deep neural network},
  author={Song Han and Xingyu Liu and Huizi Mao and Jing Pu and A. Pedram and M. Horowitz and B. Dally},
  journal={2016 IEEE Hot Chips 28 Symposium (HCS)},
This article consists only of a collection of slides from the author's conference presentation. 
Model Compression for Image Classification Based on Low-Rank Sparse Quantization
A low-rank sparse quantization method to quantize weights and regularize the structures of convolutional networks at the same time to reduce memory and computation cost and learn a compact structure from complex neural networks for subsequent channel pruning. Expand
A Novel Low-Bit Quantization Strategy for Compressing Deep Neural Networks
A novel strategy to train low-bit networks with weights and activations quantized by several bits to address two corresponding fundamental issues and show that this method can dramatically compress the neural network with slight accuracy loss. Expand
A High Energy-Efficiency Inference Accelerator Exploiting Sparse CNNs
A flexible CNNs inference accelerator on FPGA utilizing uniform sparsity introduced by pattern pruning to achieve high performance and a novel data buffering structure with slightly rearranged sequences is applied to address the challenge of access conflict. Expand
Modified Huffman based compression methodology for Deep Neural Network Implementation on Resource Constrained Mobile Platforms
This paper proposes a modified Huffman encoding-decoding technique, with dynamic usage of net layers, executed on-the-fly in parallel, which can be applied on a memory constrained multicore environment, and is the first study on applying compression based on multiple bit pattern sequences. Expand
Data-Driven Compression of Convolutional Neural Networks
The paper demonstrates that a model compression algorithm utilizing reinforcement learning with architecture search and knowledge distillation can answer three questions on compressing CNN models automatically and demonstrates that it can scale the compression algorithm to execute within a reasonable amount of time for many deployments. Expand
CaffePresso: Accelerating Convolutional Networks on Embedded SoCs
In CaffePresso, auto-tuning of the implementation parameters, and platform-specific constraints deliver optimized solutions for each input ConvNet specification. Expand
SPARCNet: A Hardware Accelerator for Efficient Deployment of Sparse Convolutional Networks
The proposed SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks, looks to enable deploying networks in embedded, resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed. Expand
Learning Sparse Convolutional Neural Network via Quantization With Low Rank Regularization
This paper proposes a low rank sparse quantization (LRSQ) method to quantize network weights and regularize the corresponding structures at the same time, and shows that this method can dramatically reduce parameters and channels of the network with slight inference accuracy loss. Expand
An Efficient End-to-End Deep Learning Training Framework via Fine-Grained Pattern-Based Pruning
This paper proposes ClickTrain: an efficient and accurate end-to-end training and pruning framework for CNNs that reduces the end- to-end time cost of the state-of-the-art pruning-after-training methods and provides higher model accuracy and compression ratio via fine-grained architecture-preserving pruning. Expand
Optimally Scheduling CNN Convolutions for Efficient Memory Access
This work introduces an accelerator architecture, named Hardware Convolution Block (HWC), which implements the optimal schedules, and shows it achieves up to 14x memory bandwidth reduction compared to a previously published accelerator with a similar memory interface, but implementing a non-optimal schedule. Expand


Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Expand
It is shown that even-number filter size is much more hardware-friendly that can ensure high bandwidth and resource utilization and even kernel can have even higher accuracy than odd size kernel. Expand
DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow
Experiments show that DSD training can improve the performance of a wide range of CNN, RNN and LSTMs on the tasks of image classification, caption generation and speech recognition. Expand
EIE: Efficient Inference
  • Engine on Compressed Deep Neural Network”,
  • 2016