FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

  title={FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer},
  author={Yang Lin and Tianyu Zhang and Peiqin Sun and Zheng Li and Shuchang Zhou},
Network quantization significantly reduces model inference complexity and has been widely used in real-world deployments. However, most existing quantization methods have been developed mainly on Convolutional Neural Networks (CNNs), and suffer severe degradation when applied to fully quantized vision transformers. In this work, we demonstrate that many of these difficulties arise because of serious inter-channel variation in LayerNorm inputs, and present, Power-of-Two Factor (PTF), a… 

Figures and Tables from this paper


Post-Training Quantization for Vision Transformer
This paper presents an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers, and thoroughly analyzes the relationship between quantization loss of different layers and the feature diversity.
Fully Quantized Network for Object Detection
This paper applies novel techniques to produce fully quantized 4-bit detectors based on RetinaNet and Faster R-CNN, and shows that these achieve state-of-the-art performance for quantized detectors.
Towards Accurate Post-training Network Quantization via Bit-Split and Stitching
This paper proposes a Bit-Split and Stitching framework (Bit-split) for lower-bit post-training quantization with minimal accuracy degradation, which can achieve near-original model performance even when quantizing FP32 models to INT3 without fine-tuning.
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
Evo-ViT is presented, a self-motivated slow-fast token evolution approach for vision transformers that can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process.
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference
This work designs a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime and proposes LeViT, a hybrid neural network for fast inference image classification that significantly outperforms existing convnets and vision transformers.
I-BERT: Integer-only BERT Quantization
This work proposes a novel integer-only quantization scheme for Transformer based models that quantizes the entire inference process, and demonstrates how to approximate nonlinear operations in Transformer architectures, e.g., GELU, Softmax, and Layer Normalization, with lightweight integer computations.
A Deep Look into Logarithmic Quantization of Model Parameters in Neural Networks
This paper proposes a new logarithmic quantization algorithm to mitigate the deterioration on neural networks which contain layers of small size and achieves the minimum accuracy loss on GoogLeNet after direct quantization compared to quantized counterparts.