# Model compression as constrained optimization, with application to neural nets. Part II: quantization

@article{CarreiraPerpin2017ModelCA, title={Model compression as constrained optimization, with application to neural nets. Part II: quantization}, author={Miguel {\'A}. Carreira-Perpi{\~n}{\'a}n and Yerlan Idelbayev}, journal={ArXiv}, year={2017}, volume={abs/1707.04319} }

We consider the problem of deep neural net compression by quantization: given a large, reference net, we want to quantize its real-valued weights using a codebook with $K$ entries so that the training loss of the quantized net is minimal. The codebook can be optimally learned jointly with the net, or fixed, as for binarization or ternarization approaches. Previous work has quantized the weights of the reference net, or incorporated rounding operations in the backpropagation algorithm, but this…

## 34 Citations

### Model compression as constrained optimization, with application to neural nets. Part I: general framework

- Computer ScienceArXiv
- 2017

This work gives a general formulation of model compression as constrained optimization, and presents separately in several companion papers the development of this general framework into specific algorithms for model compression based on quantization, pruning and other variations, including experimental results on compressing neural nets and other models.

### Optimal Neural Net Compression via Constrained Optimization

- Computer Science
- 2018

A general algorithm to optimize this nonconvex problem based on a penalty function (quadratic penalty or augmented Lagrangian) and alternating optimization results in a “learning-compression” algorithm, which is simple to implement in existing deep learning toolboxes and efficient, with a runtime comparable to that of training a reference model in the first place.

### Training with Quantization Noise for Extreme Model Compression

- Computer ScienceICLR
- 2021

This paper proposes to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification.

### Training with Quantization Noise for Extreme Fixed-Point Compression

- Computer Science
- 2020

This paper proposes to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification.

### BinaryRelax: A Relaxation Approach For Training Deep Neural Networks With Quantized Weights

- Computer ScienceSIAM J. Imaging Sci.
- 2018

BinaryRelax is proposed, a simple two-phase algorithm for training deep neural networks with quantized weights that relax the hard constraint into a continuous regularizer via Moreau envelope, which turns out to be the squared Euclidean distance to the set of quantization weights.

### EXTREME MODEL COMPRESSION

- Computer Science
- 2021

The proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification.

### PROXQUANT: QUANTIZED NEURAL NETWORKS VIA

- Computer Science
- 2018

A more principled alternative approach, called PROXQUANT, is proposed that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method, which outperforms state-of-the-art results on binary quantization and is on par on multi-bit quantization.

### Mirror Descent View for Neural Network Quantization

- Computer ScienceAISTATS
- 2021

By interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, a Mirror Descent (MD) framework for NN quantization is introduced and conditions on the projections are provided which would enable us to derive valid mirror maps and in turn the respective MD updates.

### An Empirical Comparison of Quantization, Pruning and Low-rank Neural Network Compression using the LC Toolkit

- Computer Science2021 International Joint Conference on Neural Networks (IJCNN)
- 2021

The choice of compression is strongly model-dependent: for example, VGG16 is better compressed with pruning, while quantization is more suitable for the ResNets, which underlines the need for a common benchmark of compression schemes with fair and objective comparisons of the models of interest.

### ProxQuant: Quantized Neural Networks via Proximal Operators

- Computer ScienceICLR
- 2019

This work proposes a more principled alternative approach, called ProxQuant, that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method, challenging the indispensability of the straight-through gradient method and providing a powerful alternative.

## References

SHOWING 1-10 OF 34 REFERENCES

### Model compression as constrained optimization, with application to neural nets. Part I: general framework

- Computer ScienceArXiv
- 2017

This work gives a general formulation of model compression as constrained optimization, and presents separately in several companion papers the development of this general framework into specific algorithms for model compression based on quantization, pruning and other variations, including experimental results on compressing neural nets and other models.

### Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

- Computer ScienceJ. Mach. Learn. Res.
- 2017

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

### Trained Ternary Quantization

- Computer ScienceICLR
- 2017

This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet.

### Compressing Deep Convolutional Networks using Vector Quantization

- Computer ScienceArXiv
- 2014

This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.

### Dropout: a simple way to prevent neural networks from overfitting

- Computer ScienceJ. Mach. Learn. Res.
- 2014

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

### Soft Weight-Sharing for Neural Network Compression

- Computer ScienceICLR
- 2017

This paper shows that competitive compression rates can be achieved by using a version of “soft weight-sharing” (Nowlan & Hinton, 1992) and achieves both quantization and pruning in one simple (re-)training procedure, exposing the relation between compression and the minimum description length (MDL) principle.

### Learning both Weights and Connections for Efficient Neural Network

- Computer ScienceNIPS
- 2015

A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.

### XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

- Computer ScienceECCV
- 2016

The Binary-Weight-Network version of AlexNet is compared with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy.

### Fixed-point feedforward deep neural network design using weights +1, 0, and −1

- Computer Science2014 IEEE Workshop on Signal Processing Systems (SiPS)
- 2014

The designed fixed-point networks with ternary weights (+1, 0, and -1) and 3-bit signal show only negligible performance loss when compared to the floating-point coun-terparts.

### Simplifying Neural Networks by Soft Weight-Sharing

- Computer ScienceNeural Computation
- 1992

A more complicated penalty term is proposed in which the distribution of weight values is modeled as a mixture of multiple gaussians, which allows the parameters of the mixture model to adapt at the same time as the network learns.