Distribution Adaptive INT8 Quantization for Training CNNs

  title={Distribution Adaptive INT8 Quantization for Training CNNs},
  author={Kang Zhao and Sida Huang and Pan Pan and Yinghan Li and Yingya Zhang and Zhenyu Gu and Yinghui Xu},
Researches have demonstrated that low bit-width (e.g., INT8) quantization can be employed to accelerate the inference process. It makes the gradient quantization very promising since the backward propagation requires approximately twice more computation than forward one. Due to the variability and uncertainty of gradient distribution, a lot of methods have been proposed to attain training stability. However, most of them ignore the channel-wise gradient distributions and the impact of gradients… 

Figures and Tables from this paper

Rethinking the Importance of Quantization Bias, Toward Full Low-Bit Training

This is the first work to quantize gradients of all layers to 8 bits in both large-scale CNN and RNN training with negligible accuracy loss, and proposes a novel adaptive piecewise quantization method to effectively limit the bias of gradient quantization noise.

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

This work presents F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication, which achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

Is Integer Arithmetic Enough for Deep Learning Training?

The novel training method forms a fully integer training pipeline that does not change the trajectory of the loss and accuracy compared to floating-point, nor does it need any special hyper-parameter tuning, distribution adjustment, or gradient clipping.

You Already Have It: A Generator-Free Low-Precision DNN Training Framework Using Stochastic Rounding

This paper innovatively proposes to employ the stochastic property of DNN training process itself and directly extract random numbers from DNNs in a self-sufficient manner and evaluates the quality of the extracted random numbers to find that high-quality random numbers widely exist in DNN's, while their quality can even pass the NIST test suite.

On the Convergence of Stochastic Gradient Descent in Low-precision Number Formats

Both deterministic and stochastic analysis of the SGD algorithm are presented, obtaining bounds that show the effect of number format, which can provide guidelines as to how SGD convergence is affected when constraints render the possibility of performing high-precision computations remote.

Exploiting the Partly Scratch-off Lottery Ticket for Quantization-Aware Training

A heuristic method is developed, dubbed as lottery ticket scratcher (LTS), which freezes a weight once the distance between the full-precision one and its quantization level is smaller than a controllable threshold, which typically eliminates 30%-60% weight updating and 15%-30% FLOPs of the backward pass.

FAT: An In-Memory Accelerator with Fast Addition for Ternary Weight Neural Networks

A Sparse Addition Control Unit is proposed, which utilizes the sparsity of TWNs to skip the null operations on zero weights and a fast addition scheme based on the memory Sense Amplifier to avoid the time overhead of both carry propagation and writing back the carry to memory cells is proposed.

Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies

This work proposes a binary multi-layer perceptron (MLP) block as an alternative to binary convolution blocks to directly model contextual dependencies and builds the BNNs with explicit Contextual De-pendency modeling, termed as BCDNet.

TAB: Unified and Optimized Ternary, Binary, and Mixed-precision Neural Network Inference on the Edge

TAB includes unified value representation, efficient data storage scheme and novel bitwise dot product pipelines on CPU/GPU platforms, and introduces a bitwidth-last data format that stores the first and second bits of the ternary values separately to remove the bit extraction overhead.

Bitwidth Heterogeneous Federated Learning with Progressive Weight Dequantization

This work introduces a pragmatic FL scenario with bitwidth heterogeneity across the participating devices, dubbed as Bitwidth Heterogeneous Federated Learning (BHFL), and proposes ProWD framework, which has a trainable weight dequantizer at the central server that progressively reconstructs the low-bitwidth weights into higher bitwidth weights, and into full-precision weights.



Towards Unified INT8 Training for Convolutional Neural Network

An attempt to build a unified 8-bit (INT8) training framework for common convolutional neural networks from the aspects of both accuracy and speed is given and two universal techniques are proposed that reduce the direction deviation of gradients and avoid illegal gradient update along the wrong direction.

Accurate and Efficient 2-bit Quantized Neural Networks

Novel techniques that individually target weight and activation quantizations resulting in an overall quantized neural network (QNN) are proposed that achieves state-of-the-art classification accuracy (comparable to full precision networks) across a range of popular models and datasets.

Fixed-Point Back-Propagation Training

By keeping the data distribution stable through a layer-wise precision-adaptive quantization, this paper is able to directly train deep neural networks using low bit-width fixed-point data and achieve guaranteed accuracy, without changing hyper parameters.

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

This work proposes to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization, to address the gap in prediction accuracy between the quantized model and the full-precision model.

Data-Free Quantization Through Weight Equalization and Bias Correction

We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer

Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

The proposed method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent is able to achieve near-floating-point accuracy on traditionally difficult networks such as MobileNets with less than 5 epochs of quantized (8-bit) retraining.

Deep Learning with Low Precision by Half-Wave Gaussian Quantization

An half-wave Gaussian quantizer (HWGQ) is proposed for forward approximation and shown to have efficient implementation, by exploiting the statistics of of network activations and batch normalization operations, and to achieve much closer performance to full precision networks than previously available low-precision networks.

Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation

  • Zhezhi HeDeliang Fan
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This work is the first to incorporate the thresholds of weight ternarization into a closed-form representation using truncated Gaussian approximation, enabling simultaneous optimization of weights and quantizer through back-propagation training.

Towards Effective Low-Bitwidth Convolutional Neural Networks

This paper tackles the problem of training a deep convolutional neural network with both low-precision weights and low-bitwidth activations by proposing a two-stage optimization strategy to progressively find good local minima and adopting a novel learning scheme to jointly train a full- Precision model alongside the low-Precision one.

Two-Step Quantization for Low-bit Neural Networks

A simple yet effective Two-Step Quantization (TSQ) framework is proposed, by decomposing the network quantization problem into two steps: code learning and transformation function learning based on the learned codes, and the sparse quantization method for code learning.