• Corpus ID: 230523998

I-BERT: Integer-only BERT Quantization

@article{Kim2021IBERTIB,
  title={I-BERT: Integer-only BERT Quantization},
  author={Sehoon Kim and Amir Gholami and Zhewei Yao and Michael W. Mahoney and Kurt Keutzer},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.01321}
}
Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for many edge processors, and it has been a challenge to deploy these models for edge applications and devices that have resource constraints. While quantization can be a viable solution to this, previous work on quantizing Transformer based models uses floating-point… 

Figures and Tables from this paper

Learned Token Pruning for Transformers
TLDR
A novel token reduction method dubbed Learned Token Pruning (LTP) is presented which adaptively removes unimportant tokens as an input sequence passes through transformer layers, and is more robust than prior methods to variations in input sequence lengths.
A Survey of Quantization Methods for Efficient Neural Network Inference
TLDR
This article surveys approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods.
Confounding Tradeoffs for Neural Network Quantization
TLDR
This work articulate a variety of tradeoffs whose impact is often overlooked and empirically analyze their impact on uniform and mixed-precision posttraining quantization, finding that these confounding tradeoffs may have a larger impact on quantized network accuracy than the actual quantization methods themselves.
Confounding Tradeoffs for Neural Network Quantization
TLDR
This work articulate a variety of tradeoffs whose impact is often overlooked and empirically analyze their impact on uniform and mixed-precision posttraining quantization, finding that these confounding tradeoffs may have a larger impact on quantized network accuracy than the actual quantization methods themselves.
RCT: Resource Constrained Training for Edge AI
Neural networks training on edge terminals is essential for edge AI computing, which needs to be adaptive to evolving environment. Quantised models can efficiently run on edge devices, but existing
DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture
  • Tao Yang, Dongyue Li, Li Jiang
  • Computer Science
    2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)
  • 2022
TLDR
This work designs an algorithm-architecture co-design with dynamic and mixed-precision quantization with an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead and conducts experiments with existing attention-based NLP models.
Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models
TLDR
The Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least 4× and as much as 15× depending on the model and on-chip buffering capacity.
NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference
TLDR
The proposed framework called NN-LUT can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.
A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices
TLDR
A clean and parameter-refined attention module is introduced to enhance the information exchange between intent and slot, improving semantic accuracy by more than 2% and reducing the inference latency to less than 100ms.
Extreme Compression for Pre-trained Transformers Made Simple and Efficient
TLDR
A simple yet effective compression pipeline for extreme compression, named XTC, which demonstrates that it can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT.
...
...

References

SHOWING 1-10 OF 114 REFERENCES
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
TLDR
This work performs an extensive analysis of fine-tuned BERT models using second order Hessian information, and uses the results to propose a novel method for quantizing BERT Models to ultra low precision, which is based on a new group-wise quantization scheme and Hessian-based mix-precision method.
HAQ: Hardware-Aware Automated Quantization
TLDR
This paper introduces the Hardware-Aware Automated Quantization (HAQ) framework, which leverages the reinforcement learning to automatically determine the quantization policy, and takes the hardware accelerator's feedback in the design loop to reduce the latency and energy consumption.
HAWQV3: Dyadic Neural Network Quantization
TLDR
This work presents HAWQV3, a novel dyadic quantization framework, and shows that mixed-precision INT4/8 quantization can be used to achieve higher speed ups, as compared to INT8 inference, with minimal impact on accuracy.
Q8BERT: Quantized 8Bit BERT
TLDR
This work shows how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4x with minimal accuracy loss and the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
TLDR
A quantization scheme is proposed that allows inference to be carried out using integer- only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware.
Quantizing deep convolutional networks for efficient inference: A whitepaper
TLDR
An overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations is presented and it is recommended that per-channel quantization of weights and per-layer quantized of activations be the preferred quantization scheme for hardware acceleration and kernel optimization.
Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model
TLDR
This work quantizes a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel Cascade Lake processors to improve inference performance while maintaining less than 0.5% drop in accuracy.
SqueezeNext: Hardware-Aware Neural Network Design
  • A. Gholami, K. Kwon, K. Keutzer
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2018
TLDR
SqueezeNext is introduced, a new family of neural network architectures whose design was guided by considering previous architectures such as SqueezeNet, as well as by simulation results on a neural network accelerator.
Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search
TLDR
A novel differentiable neural architecture search (DNAS) framework is proposed to efficiently explore its exponential search space with gradient-based optimization and surpass the state-of-the-art compression of ResNet on CIFAR-10 and ImageNet.
LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks
TLDR
This work proposes to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization, to address the gap in prediction accuracy between the quantized model and the full-precision model.
...
...