• Corpus ID: 8979489

Model compression as constrained optimization, with application to neural nets. Part II: quantization

@article{CarreiraPerpin2017ModelCA,
  title={Model compression as constrained optimization, with application to neural nets. Part II: quantization},
  author={Miguel {\'A}. Carreira-Perpi{\~n}{\'a}n and Yerlan Idelbayev},
  journal={ArXiv},
  year={2017},
  volume={abs/1707.04319}
}
We consider the problem of deep neural net compression by quantization: given a large, reference net, we want to quantize its real-valued weights using a codebook with $K$ entries so that the training loss of the quantized net is minimal. The codebook can be optimally learned jointly with the net, or fixed, as for binarization or ternarization approaches. Previous work has quantized the weights of the reference net, or incorporated rounding operations in the backpropagation algorithm, but this… 

Model compression as constrained optimization, with application to neural nets. Part I: general framework

This work gives a general formulation of model compression as constrained optimization, and presents separately in several companion papers the development of this general framework into specific algorithms for model compression based on quantization, pruning and other variations, including experimental results on compressing neural nets and other models.

Optimal Neural Net Compression via Constrained Optimization

A general algorithm to optimize this nonconvex problem based on a penalty function (quadratic penalty or augmented Lagrangian) and alternating optimization results in a “learning-compression” algorithm, which is simple to implement in existing deep learning toolboxes and efficient, with a runtime comparable to that of training a reference model in the first place.

Training with Quantization Noise for Extreme Model Compression

This paper proposes to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification.

Training with Quantization Noise for Extreme Fixed-Point Compression

This paper proposes to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification.

BinaryRelax: A Relaxation Approach For Training Deep Neural Networks With Quantized Weights

BinaryRelax is proposed, a simple two-phase algorithm for training deep neural networks with quantized weights that relax the hard constraint into a continuous regularizer via Moreau envelope, which turns out to be the squared Euclidean distance to the set of quantization weights.

EXTREME MODEL COMPRESSION

The proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification.

PROXQUANT: QUANTIZED NEURAL NETWORKS VIA

A more principled alternative approach, called PROXQUANT, is proposed that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method, which outperforms state-of-the-art results on binary quantization and is on par on multi-bit quantization.

Mirror Descent View for Neural Network Quantization

By interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, a Mirror Descent (MD) framework for NN quantization is introduced and conditions on the projections are provided which would enable us to derive valid mirror maps and in turn the respective MD updates.

An Empirical Comparison of Quantization, Pruning and Low-rank Neural Network Compression using the LC Toolkit

The choice of compression is strongly model-dependent: for example, VGG16 is better compressed with pruning, while quantization is more suitable for the ResNets, which underlines the need for a common benchmark of compression schemes with fair and objective comparisons of the models of interest.

ProxQuant: Quantized Neural Networks via Proximal Operators

This work proposes a more principled alternative approach, called ProxQuant, that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method, challenging the indispensability of the straight-through gradient method and providing a powerful alternative.

References

SHOWING 1-10 OF 34 REFERENCES

Model compression as constrained optimization, with application to neural nets. Part I: general framework

This work gives a general formulation of model compression as constrained optimization, and presents separately in several companion papers the development of this general framework into specific algorithms for model compression based on quantization, pruning and other variations, including experimental results on compressing neural nets and other models.

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

Trained Ternary Quantization

This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet.

Compressing Deep Convolutional Networks using Vector Quantization

This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.

Dropout: a simple way to prevent neural networks from overfitting

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Soft Weight-Sharing for Neural Network Compression

This paper shows that competitive compression rates can be achieved by using a version of  “soft weight-sharing” (Nowlan & Hinton, 1992) and achieves both quantization and pruning in one simple (re-)training procedure, exposing the relation between compression and the minimum description length (MDL) principle.

Learning both Weights and Connections for Efficient Neural Network

A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

The Binary-Weight-Network version of AlexNet is compared with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy.

Fixed-point feedforward deep neural network design using weights +1, 0, and −1

The designed fixed-point networks with ternary weights (+1, 0, and -1) and 3-bit signal show only negligible performance loss when compared to the floating-point coun-terparts.

Simplifying Neural Networks by Soft Weight-Sharing

A more complicated penalty term is proposed in which the distribution of weight values is modeled as a mixture of multiple gaussians, which allows the parameters of the mixture model to adapt at the same time as the network learns.