VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization

  title={VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization},
  author={Cheng Gong and Yao-Yu Chen and Ye Lu and Tao Li and Cong Hao and Deming Chen},
  journal={IEEE Transactions on Computers},
Quantization has been proven to be an effective method for reducing the computing and/or storage cost of DNNs. However, the trade-off between the quantization bitwidth and final accuracy is complex and non-convex, which makes it difficult to be optimized directly. Minimizing direct quantization loss (DQL) of the coefficient data is an effective local optimization method, but previous works often neglect the accurate control of the DQL, resulting in a higher loss of the final DNN model accuracy… Expand
Filter Pre-Pruning for Improved Fine-tuning of Quantized Deep Neural Networks
A new pruning method called Pruning for Quantization (PfQ) is proposed which removes the filters that disturb the fine-tuning of the DNN while not affecting the inferred result as far as possible and achieves higher performance with a similar model size than conventional quantization methods including fine- tuning. Expand
  • 2020
Deep Neural Networks(DNNs) have many parameters and activation data, and these both are expensive to implement. One method to reduce the size of the DNN is to quantize the pre-trained model by usingExpand
Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators
A mixed precision quantization scheme for ReRAM-based DNN inference accelerators where weightquantization, input quantization, and partial sum quantization are jointly applied for each DNN layer and an automated quantization flow powered by deep reinforcement learning to search for the best quantization configuration in the large design space is proposed. Expand
3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration
A novel rank-adaptive tensor-based tensorized neural network model is proposed, which offers orders-of-magnitude memory reduction during training and an ultra-low bitwidth quantization method for DNN model compression, achieving the state- of-the-art accuracy under the same compression ratio. Expand
FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations
The proposed FracBNN exploits fractional activations to substantially improve the accuracy of BNNs, and implements the entire optimized network architecture on an embedded FPGA (Xilinx Ultra96 v2) with the ability of real-time image classification. Expand
Blackthorn: Latency Estimation Framework for CNNs on Embedded Nvidia Platforms
This work proposes Blackthorn, a layer-wise latency estimation framework for embedded Nvidia GPUs based on analytical models that provides accurate predictions for each layer, helping developers to find bottlenecks and optimize the architecture of a DNN to fit target platforms. Expand
The increase in data production has enabled a newly found interest in deeplearning-based solutions. From service customization to healthcare applications, deep neural networks have successfully beenExpand
Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Codesign
The authors argue that workloads that were formerly performed in the cloud are increasingly moving to resource-limited edge computing systems, which raises a new set of challenges for machine learning as well as new opportunities. Expand
A high-throughput scalable BNN accelerator with fully pipelined architecture
Elastic Significant Bit Quantization and Acceleration for Deep Neural Networks
A new method called elastic significant bit quantization (ESB) that controls the number of significant bits of quantized values to obtain better inference accuracy with fewer resources is proposed and implemented as an accelerator and quantitatively evaluates its efficiency on FPGAs. Expand


µL2Q: An Ultra-Low Loss Quantization Method for DNN Compression
  • Cheng Gong, Tao Li, +4 authors Yao Chen
  • Computer Science
  • 2019 International Joint Conference on Neural Networks (IJCNN)
  • 2019
This work proposes an effective method, called ultra-low loss quantization (µL2Q), to provide DNN quantization schemes based on comprehensive quantitative data analysis, which builds the transformation of the original data to a data space with standard normal distribution, and finds the optimal parameters to minimize the loss of the quantization of a targeted bit width. Expand
Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights
Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed INQ, showing that at 5-bit quantization, models have improved accuracy than the 32-bit floating-point references. Expand
Two-Step Quantization for Low-bit Neural Networks
A simple yet effective Two-Step Quantization (TSQ) framework is proposed, by decomposing the network quantization problem into two steps: code learning and transformation function learning based on the learned codes, and the sparse quantization method for code learning. Expand
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Expand
HAQ: Hardware-Aware Automated Quantization With Mixed Precision
The Hardware-Aware Automated Quantization (HAQ) framework is introduced which leverages the reinforcement learning to automatically determine the quantization policy, and takes the hardware accelerator's feedback in the design loop to generate direct feedback signals to the RL agent. Expand
Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM
This paper focuses on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network, and proposes to solve this problem using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Expand
Quantizing deep convolutional networks for efficient inference: A whitepaper
An overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations is presented and it is recommended that per-channel quantization of weights and per-layer quantized of activations be the preferred quantization scheme for hardware acceleration and kernel optimization. Expand
Trained Ternary Quantization
This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet. Expand
GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training
This paper empirically demonstrate the strong linear correlations between CNN gradients, and proposes a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. Expand
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
A quantization scheme is proposed that allows inference to be carried out using integer- only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. Expand