• Corpus ID: 227127479

HAWQV3: Dyadic Neural Network Quantization

@inproceedings{Yao2021HAWQV3DN,
  title={HAWQV3: Dyadic Neural Network Quantization},
  author={Zhewei Yao and Zhen Dong and Zhangcheng Zheng and Amir Gholami and Jiali Yu and Eric Tan and Leyuan Wang and Qijing Huang and Yida Wang and Michael W. Mahoney and Kurt Keutzer},
  booktitle={ICML},
  year={2021}
}
Quantization is one of the key techniques used to make Neural Networks (NNs) faster and more energy efficient. However, current low precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing NNs. To address this, we present HAWQV3, a novel dyadic quantization framework. The contributions of HAWQV3 are the following. (i) The entire inference… 

Figures and Tables from this paper

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization
TLDR
This work presents F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication, which achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.
Post-Training Quantization for Cross-Platform Learned Image Compression
TLDR
This work introduces well-developed post-training quantization and makes the model inference integer-arithmetic-only, which is much simpler than presently existing training and fine-tuning based approaches yet still keeps the superior rate-distortion performance of learned image compression.
TinyM2Net: A Flexible System Algorithm Co-designed Multimodal Learning Framework for Tiny Devices
TLDR
TinyM2Net shows that even a tiny multimodal learning model can improve the classification performance than that of any unimodal frameworks, and is designed to implement on tiny devices.
OMPQ: Orthogonal Mixed Precision Quantization
TLDR
This work proposes to optimize a proxy metric, the concept of network orthogonality, which is highly correlated with the loss of the integer programming but also easy to optimize with linear programming, which reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
BMPQ: Bit-Gradient Sensitivity-Driven Mixed-Precision Quantization of DNNs from Scratch
TLDR
BMPQ is presented, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models that requires a single training iteration but does not need a pre-trained baseline.
Quantization in Layer's Input is Matter
In this paper, we will show that the quantization in layer’s input is more important than parameters’ quantization for loss function. And the algorithm which is based on the layer’s input
SPDY: Accurate Pruning with Speedup Guarantees
TLDR
SPYD is a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, while minimizing accuracy loss, and is compatible with most existing pruning approaches.
Neural Network Quantization for Efficient Inference: A Survey
TLDR
This paper surveys the many neural network quantization techniques that have been developed in the last decade and proposes future directions of research in the area.
A Survey of Quantization Methods for Efficient Neural Network Inference
TLDR
This article surveys approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods.
I-BERT: Integer-only BERT Quantization
TLDR
This work proposes a novel integer-only quantization scheme for Transformer based models that quantizes the entire inference process, and demonstrates how to approximate nonlinear operations in Transformer architectures, e.g., GELU, Softmax, and Layer Normalization, with lightweight integer computations.
...
1
2
3
...

References

SHOWING 1-10 OF 71 REFERENCES
HAWQ-V2: Hessian aware traceweighted quantization of neural networks. Advances in neural information processing
  • 2020
Kalenichenko. Quantization and training of neural networks for efficient integerarithmetic-only inference
  • In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
  • 2018
HAQ: Hardware-Aware Automated Quantization
TLDR
This paper introduces the Hardware-Aware Automated Quantization (HAQ) framework, which leverages the reinforcement learning to automatically determine the quantization policy, and takes the hardware accelerator's feedback in the design loop to reduce the latency and energy consumption.
HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks
TLDR
A theoretical analysis shows that a better sensitivity metric is to compute the average of all of the Hessian eigenvalues, and a Pareto frontier based method for selecting the exact bit precision of different layers without any manual selection is developed.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TLDR
TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
Zero-Centered Fixed-Point Quantization With Iterative Retraining for Deep Convolutional Neural Network-Based Object Detectors
TLDR
In the proposed method, the center of the weight distribution is adjusted to zero by subtracting the mean of weight parameters before quantization, and the retraining process is iteratively applied to minimize the accuracy drop caused by quantization.
Bayesian Bits: Unifying Quantization and Pruning
TLDR
Bayesian Bits is introduced, a practical method for joint mixed precision quantization and pruning through gradient based optimization that can learn pruned, mixed precision networks that provide a better trade-off between accuracy and efficiency than their static bit width equivalents.
DRQ: Dynamic Region-based Quantization for Deep Neural Network Acceleration
TLDR
A dynamic region-based quantization, namely DRQ, which can change the precision of a DNN model dynamically based on the sensitive regions in the feature map to achieve greater acceleration while reserving better NN accuracy is proposed.
Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion
TLDR
DeepInversion is introduced, a new method for synthesizing images from the image distribution used to train a deep neural network, which optimizes the input while regularizing the distribution of intermediate feature maps using information stored in the batch normalization layers of the teacher.
Efficient Execution of Quantized Deep Learning Models: A Compiler Approach
TLDR
This paper addresses the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach that created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context.
...
1
2
3
4
5
...