• Publications
  • Influence
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
TLDR
We introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Expand
  • 3,853
  • 453
  • PDF
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
TLDR
We present SqueezeNet, a CNN architecture that has 50× fewer parameters than AlexNet and maintains AlexNet-level accuracy on ImageNet. Expand
  • 2,298
  • 374
  • PDF
Learning both Weights and Connections for Efficient Neural Network
TLDR
We present a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. Expand
  • 2,609
  • 360
  • PDF
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
TLDR
We address the high memory consumption issue of differentiable NAS and reduce the computational cost (GPU hours and GPU memory) to the same level of regular training while still allowing a large candidate set. Expand
  • 546
  • 163
  • PDF
EIE: Efficient Inference Engine on Compressed Deep Neural Network
TLDR
We propose an energy efficient inference engine (EIE) that performs inference on compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Expand
  • 1,319
  • 148
  • PDF
- LEVEL ACCURACY WITH 50 X FEWER PARAMETERS AND < 0 . 5 MB MODEL SIZE
Recent research on deep convolutional neural networks (CNNs) has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple CNN architecturesExpand
  • 648
  • 113
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
TLDR
We propose AutoML for Model Compression (AMC) which leverages reinforcement learning to efficiently sample the design space and improve the model compression quality. Expand
  • 450
  • 89
  • PDF
WirelessHART: Applying Wireless Technology in Real-Time Industrial Process Control
TLDR
In this paper, we give an introduction to the architecture of WirelessHART and share our first hand experience in building a prototype for this specification. Expand
  • 584
  • 70
  • PDF
Trained Ternary Quantization
TLDR
We propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. Expand
  • 586
  • 70
  • PDF
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
TLDR
We propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. Expand
  • 334
  • 63
  • PDF