Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

@article{Xu2020LowbitQO,
  title={Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers},
  author={Junhao Xu and Xie Chen and Shoukang Hu and Jianwei Yu and Xunying Liu and Helen M. Meng},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={7939-7943}
}
  • Junhao XuXie Chen H. Meng
  • Published 1 May 2020
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The high memory consumption and computational costs of Recurrent neural network language models (RNNLMs) limit their wider application on resource constrained devices. In recent years, neural network quantization techniques that are capable of producing extremely low-bit compression, for example, binarized RNNLMs, are gaining increasing research interests. Directly training of quantized neural networks is difficult. By formulating quantized RNNLMs training as an optimization problem, this paper… 

Figures and Tables from this paper

Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition

Novel mixed precision neural network LM quantization techniques achieved “lossless” quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.

Mixed Precision Quantization of Transformer Language Models for Speech Recognition

Novel mixed precision DNN quantization methods based on Hessian trace weighted quantization perturbation and Alternating direction methods of multipliers are used to efficiently train mixed precision quantized DNN systems.

4-bit Quantization of LSTM-based Speech Recognition Models

This work customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time and shows that minimal accuracy loss is achiev-able with an appropriate choice of quantizers and initializations.

Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus

Experiments suggest that the proposed neural architectural compression and mixed precision quantization techniques consistently outperform the uniform precision quantised baseline systems of comparable bit-widths in terms of word error rate (WER).

Mixed Precision DNN Quantization for Overlapped Speech Separation and Recognition

Novel mixed precision DNN quantization methods are proposed by applying locally variable bit-widths to individual TCN components of a TF masking based multi-channel speech separation system by automatically learning the optimal local precision settings using three techniques.

References

SHOWING 1-10 OF 39 REFERENCES

Alternating Multi-bit Quantization for Recurrent Neural Networks

This work quantizes the network, both weights and activations, into multiple binary codes {-1,+1}, and forms the quantization as an optimization problem, which in both RNNs and feedforward neural networks achieves excellent performance and is extended to image classification tasks.

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

Limited-Memory BFGS Optimization of Recurrent Neural Network Language Models for Speech Recognition

  • Xunying LiuShansong Liu H. Meng
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A limited-memory Broyden Fletcher Goldfarb Shannon (L-BFGS) based second order optimization technique is proposed for RNNLMs that efficiently approximates the matrix-vector product between the inverse Hessian and gradient vector via a recursion over past gradients with a compact memory requirement.

Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM

This paper focuses on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network, and proposes to solve this problem using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods.

Trained Ternary Quantization

This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet.

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

Improving the speed of neural networks on CPUs

This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

Compressing Deep Convolutional Networks using Vector Quantization

This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.

Learning Structured Sparsity in Deep Neural Networks

The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers.