• Corpus ID: 237353343

Compact representations of convolutional neural networks via weight pruning and quantization

  title={Compact representations of convolutional neural networks via weight pruning and quantization},
  author={Giosu{\`e} Cataldo Marin{\`o} and Alessandro Petrini and Dario Malchiodi and Marco Frasca},
The state-of-the-art performance for several realworld problems is currently reached by convolutional neural networks (CNN). Such learning models exploit recent results in the field of deep learning, typically leading to highly performing, yet very large neural networks with (at least) millions of parameters. As a result, the deployment of such models is not possible when only small amounts of RAM are available, or in general within resource-limited platforms, and strategies to compress CNNs… 

Figures and Tables from this paper



Reproducing the Sparse Huffman Address Map Compression for Deep Neural Networks

The proposed implementation, which is described in this paper, offers different compression schemes (pruning, two types of weight quantization, and their combinations) and two compact representations: the Huffman Address Map compression (HAM), and its sparse version sHAM.

A Survey of Model Compression and Acceleration for Deep Neural Networks

This paper survey the recent advanced techniques for compacting and accelerating CNNs model developed, roughly categorized into four schemes: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation.

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed INQ, showing that at 5-bit quantization, models have improved accuracy than the 32-bit floating-point references.

Value-aware Quantization for Training and Inference of Neural Networks

We propose a novel value-aware quantization which applies aggressively reduced precision to the majority of data while separately handling a small amount of large data in high precision, which

Speeding up Convolutional Neural Networks with Low Rank Expansions

Two simple schemes for drastically speeding up convolutional neural networks are presented, achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain.

On Compressing Deep Models by Low Rank and Sparse Decomposition

A unified framework integrating the low-rank and sparse decomposition of weight matrices with the feature map reconstructions is proposed, which can significantly reduce the parameters for both convolutional and fully-connected layers.

Pruning Filters for Efficient ConvNets

This work presents an acceleration method for CNNs, where it is shown that even simple filter pruning techniques can reduce inference costs for VGG-16 and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.

Universal Deep Neural Network Compression

This work for the first time introduces universal DNN compression by universal vector quantization and universal source coding, which utilizes universal lattice quantization, which randomizes the source by uniform random dithering before latticequantization and can perform near-optimally on any source without relying on knowledge of the source distribution.

Importance Estimation for Neural Network Pruning

A novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores and two variations of this method using the first and second-order Taylor expansions to approximate a filter's contribution are described.