Overcoming Oscillations in Quantization-Aware Training

  title={Overcoming Oscillations in Quantization-Aware Training},
  author={Markus Nagel and Marios Fournarakis and Yelysei Bondarenko and Tijmen Blankevoort},
When training neural networks with simulated quantization, we observe that quantized weights can, rather unexpectedly, oscillate between two grid-points. The importance of this effect and its impact on quantization-aware training (QAT) are not well-understood or investigated in literature. In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics during… 

Oscillation-free Quantization for Low-bit Vision Transformers

This study investigates the connection between the learnable scaling factor and quantized weight oscillation and uses ViT as a case driver to illustrate the findings and remedies, and proposes three techniques accordingly to improve quantization robustness.

Genie: Show Me the Data for Quantization

A post-training quantization scheme for zero-shot quantization that produces high-quality quantized networks within a few hours on even half an hour is introduced and a framework called G ENIE that generates data suited for post- training quantization is proposed.

Neural Networks with Quantization Constraints

This work forms low precision supervised learning as a constrained optimization problem, and shows that despite its non-convexity, the resulting problem is strongly dual and does away with gradient estimations, and indicates that dual variables indicate the sensitivity of the objective with respect to constraint perturbations.

Exploiting the Partly Scratch-off Lottery Ticket for Quantization-Aware Training

To effectively find the ticket, a heuristic method is developed, dubbed as lottery ticket scratcher (LTS), which freezes a weight once the distance between the full-precision one and its quantization level is smaller than a controllable threshold.

A Practical Mixed Precision Algorithm for Post-Training Quantization

A simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance and is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use.

Differentiable Model Compression via Pseudo Quantization Noise

D IFF Q is a differentiable method for model com- pression for quantizing model parameters without gradient ap-proximations (e.g., Straight Through Estimator) and is differentiable both with respect to the un- quantized weights and the number of bits used.

FP8 Quantization: The Power of the Exponent

The chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network.

OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks

This work presents OLLA, an algorithm that optimizes the lifetime and memory location of the tensors used to train neural networks, and enables the approach to scale to the size of state-of-the-art neural networks using an off- the-shelf ILP solver.

MixBin: Towards Budgeted Binarization

This paper proposes a paradigm to perform partial binarization of neural networks in a controlled sense, thereby constructing budgeted binary neural network (B2NN), and presents MixBin, an iterative search-based strategy that constructs B2NN through optimized mixing of the binary and full-precision components.

OLLA: Decreasing the Memory Usage of Neural Networks by Optimizing the Lifetime and Location of Arrays

This work presents OLLA, an algorithm that optimizes the lifetime and memory location of the tensors used to train neural networks, and enables the approach to scale to the size of state-of-the-art neural networks using an off- the-shelf ILP solver.



Logarithmic Unbiased Quantization: Practical 4-bit Training in Deep Learning

This work suggests a logarithmic unbiased quantization (LUQ) method to quantize both the forward and backward phase to 4-bit, achieving state-of-the-art results in 4- bit training without overhead.

Improving Low-Precision Network Quantization via Bin Regularization

This work proposes a novel weight regularization algorithm for improving low-precision network quantization that separably optimize all elements in each quantization bin to be as close to the target quantized value as possible.

A White Paper on Neural Network Quantization

This paper introduces state-of-the-art algorithms for mitigating the impact of quantization noise on the network’s performance while maintaining low-bit weights and activations and considers two main classes of algorithms: Post-Training Quantization and Quantization-Aware-Training.

Loss aware post-training quantization

This work studies the effect of quantization on the structure of the loss landscape, and designs a method that quantizes the layer parameters jointly, enabling significant accuracy improvement over current post-training quantization methods.

PACT: Parameterized Clipping Activation for Quantized Neural Networks

It is shown, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets.

Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition

A novel QAT scheme based on absolute-cosine regularization (ACosR), which enforces a prior, quantization-friendly distribution to the model weights, is introduced and applied into ASR task assuming a recurrent neural network transducer (RNN-T) architecture.

Training with Quantization Noise for Extreme Model Compression

This paper proposes to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights, establishing new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification.

Post training 4-bit quantization of convolutional networks for rapid-deployment

This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset, and achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models.

Quantizing deep convolutional networks for efficient inference: A whitepaper

An overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations is presented and it is recommended that per-channel quantization of weights and per-layer quantized of activations be the preferred quantization scheme for hardware acceleration and kernel optimization.

Quantization Networks

This paper provides a simple and uniform way for weights and activations quantization by formulating it as a differentiable non-linear function that will shed new lights on the interpretation of neural network quantization.