Corpus ID: 196470957

And the Bit Goes Down: Revisiting the Quantization of Neural Networks

@article{Stock2020AndTB,
  title={And the Bit Goes Down: Revisiting the Quantization of Neural Networks},
  author={Pierre Stock and Armand Joulin and R. Gribonval and Benjamin Graham and H. J{\'e}gou},
  journal={ArXiv},
  year={2020},
  volume={abs/1907.05686}
}
In this paper, we address the problem of reducing the memory footprint of convolutional network architectures. [...] Key Method Our method only requires a set of unlabelled data at quantization time and allows for efficient inference on CPU by using byte-aligned codebooks to store the compressed weights. We validate our approach by quantizing a high performing ResNet-50 model to a memory size of 5MB (20x compression factor) while preserving a top-1 accuracy of 76.1% on ImageNet object classification and by…Expand
Towards Convolutional Neural Networks Compression via Global&Progressive Product Quantization
TLDR
G&P PQ, an end-to-end product quantization based network compression method, is introduced to merge the separate quantization and finetuning process into a consistent training framework and make the network capable of learning complex dependencies among layers by quantizing globally and progressively. Expand
EXTREME MODEL COMPRESSION
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training (Jacob et al., 2018),Expand
Training with Quantization Noise for Extreme Fixed-Point Compression
TLDR
This paper proposes to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. Expand
Differentiable Model Compression via Pseudo Quantization Noise
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. This method, DIFFQ, is differentiable both withExpand
A White Paper on Neural Network Quantization
TLDR
This paper introduces state-of-the-art algorithms for mitigating the impact of quantization noise on the network’s performance while maintaining low-bit weights and activations and considers two main classes of algorithms: Post-Training Quantization and Quantization-Aware-Training. Expand
Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms
TLDR
This paper proposes a method to optimally quantize the weights, biases and activations of each layer of a pre-trained CNN while controlling the loss in inference accuracy to enable quantized inference and gives a low precision CNN with accuracy losses of less than 1%. Expand
Accelerating Neural Network Inference by Overflow Aware Quantization
TLDR
Experimental results demonstrate that the proposed overflow aware quantization method can achieve comparable performance with state-of-the-art quantization methods while accelerating the inference process by about 2 times. Expand
Transform Quantization for CNN Compression
TLDR
This paper optimally transform and quantize the weights post-training using a ratedistortion framework to improve compression at any given quantization bit-rate, and finds that transform quantization with retraining is able to compress CNN models such as AlexNet, ResNet and DenseNet to very low bit-rates. Expand
Spatial Shift Point-Wise Quantization
TLDR
The proposed spatial-shift pointwise quantization (SSPQ) model elegantly combines compact network-design techniques to revitalize DNN quantization efficiency with little accuracy loss. Expand
Exploring Neural Networks Quantization via Layer-Wise Quantization Analysis
TLDR
A simple analytic framework that breaks down overall degradation to its per layer contributions and allows for a more nuanced examination of how quantization affects the network, enabling the design of better performing schemes. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 60 REFERENCES
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
TLDR
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Expand
Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights
TLDR
Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed INQ, showing that at 5-bit quantization, models have improved accuracy than the 32-bit floating-point references. Expand
Compressing Deep Convolutional Networks using Vector Quantization
TLDR
This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods. Expand
Learning Efficient Convolutional Networks through Network Slimming
TLDR
The approach is called network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. Expand
ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression
TLDR
ThiNet is proposed, an efficient and unified framework to simultaneously accelerate and compress CNN models in both training and inference stages, and it is revealed that it needs to prune filters based on statistics information computed from its next layer, not the current layer, which differentiates ThiNet from existing methods. Expand
Towards the Limit of Network Quantization
TLDR
It is derived that the network quantization problem can be related to the entropy-constrained scalar quantization (ECSQ) problem in information theory and two solutions of ECSQ are proposed, i.e., uniform quantization and an iterative solution similar to Lloyd's algorithm. Expand
Model compression as constrained optimization, with application to neural nets. Part II: quantization
TLDR
This work describes a new approach based on the recently proposed framework of model compression as constrained optimization, which results in a simple iterative "learning-compression" algorithm that can achieve much higher compression rates than previous quantization work (even using just 1 bit per weight) with negligible loss degradation. Expand
A Survey on Methods and Theories of Quantized Neural Networks
TLDR
A thorough review of different aspects of quantized neural networks is given, recognized as one of the most effective approaches to satisfy the extreme memory requirements that deep neural network models demand. Expand
Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
TLDR
Using large state-of-the-art models, this work demonstrates speedups of convolutional layers on both CPU and GPU by a factor of 2 x, while keeping the accuracy within 1% of the original model. Expand
Learning Transferable Architectures for Scalable Image Recognition
TLDR
This paper proposes to search for an architectural building block on a small dataset and then transfer the block to a larger dataset and introduces a new regularization technique called ScheduledDropPath that significantly improves generalization in the NASNet models. Expand
...
1
2
3
4
5
...