Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

@article{Jacob2017QuantizationAT,
  title={Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference},
  author={Benoit Jacob and Skirmantas Kligys and Bo Chen and Menglong Zhu and Matthew Tang and Andrew G. Howard and Hartwig Adam and Dmitry Kalenichenko},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2017},
  pages={2704-2713}
}
The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. [] Key Method We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As…

Integer-Only Neural Network Quantization Scheme Based on Shift-Batch-Normalization

This scheme uses one layer that combines shift-based batch normalization and uniform quantization to implement 4-bit integer-only inference, and can achieve good power and latency efficiency, and is especially suitable to be deployed on co-designed hardware platforms.

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

This paper presents a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.

A Hybrid Asymmetric Integer-Only Quantization Method of Neural Networks for Efficient Inference

  • Wei LuMa ZhongYang Chaojie
  • Computer Science
    2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)
  • 2022
An efficient hybrid asymmetric Integer-only quantization method for different types of neural network layers is proposed, which can resolve the contradiction between the quantization accuracy and the ease of implementation, and balance the trade-off between clipping range and quantization resolution, and thus improve the accuracy of the quantized neural network.

Low-bit Quantization of Neural Networks for Efficient Inference

This paper formalizes the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations, allowing low-bit precision inference without the need for full network retraining.

Bit Efficient Quantization for Deep Neural Networks

This paper presents a comparison of model-parameter driven quantization approaches that can achieve as low as 3-bit precision without affecting accuracy and shows the methods to lower bit-precision beyond quantization limits with object class clustering.

Self-Supervised Quantization of Pre-Trained Neural Networks for Multiplierless Acceleration

A novel quantization procedure for parameters and activations of pre-trained neural networks for 8 bit linear quantization achieves close to original network performance without retraining and consequently does not require labeled training data.

A 4-bit Integer-Only Neural Network Quantization Method Based on Shift Batch Normalization

This paper proposed an integer-only quantization method with no division or big integer multiplication, suitable to be deployed on co-designed hardware platforms and deployed under OpenCL framework and on a flash-based in-memory-computing chip to verify this method’s feasibility.

Deep Learning Optimization for Edge Devices: Analysis of Training Quantization Parameters

In-depth analysis of parameters in the quantization aware training, the process of simulating precision loss in the forward pass by quantizing and dequantizing tensors and locations of precision loss simulation to evaluate how they affect accuracy of deep neural network aimed at performing efficient calculations on resource-constrained devices are performed.

Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms

This paper proposes a method to optimally quantize the weights, biases and activations of each layer of a pre-trained CNN while controlling the loss in inference accuracy to enable quantized inference and gives a low precision CNN with accuracy losses of less than 1%.
...

References

SHOWING 1-10 OF 33 REFERENCES

Deep Learning with Limited Numerical Precision

The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.

Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM

This paper focuses on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network, and proposes to solve this problem using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods.

Trained Ternary Quantization

This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet.

Improving the speed of neural networks on CPUs

This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

An extremely computation-efficient CNN architecture named ShuffleNet is introduced, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs), to greatly reduce computation cost while maintaining accuracy.

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

Compressing Deep Convolutional Networks using Vector Quantization

This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

The Binary-Weight-Network version of AlexNet is compared with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy.

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bit width parameter gradients, is proposed and can achieve comparable prediction accuracy as 32-bit counterparts.