Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers
@article{Xu2020LowbitQO, title={Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers}, author={Junhao Xu and Xie Chen and Shoukang Hu and Jianwei Yu and Xunying Liu and Helen M. Meng}, journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2020}, pages={7939-7943} }
The high memory consumption and computational costs of Recurrent neural network language models (RNNLMs) limit their wider application on resource constrained devices. In recent years, neural network quantization techniques that are capable of producing extremely low-bit compression, for example, binarized RNNLMs, are gaining increasing research interests. Directly training of quantized neural networks is difficult. By formulating quantized RNNLMs training as an optimization problem, this paper…
5 Citations
Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
Novel mixed precision neural network LM quantization techniques achieved “lossless” quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.
Mixed Precision Quantization of Transformer Language Models for Speech Recognition
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
Novel mixed precision DNN quantization methods based on Hessian trace weighted quantization perturbation and Alternating direction methods of multipliers are used to efficiently train mixed precision quantized DNN systems.
4-bit Quantization of LSTM-based Speech Recognition Models
- Computer ScienceInterspeech
- 2021
This work customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time and shows that minimal accuracy loss is achiev-able with an appropriate choice of quantizers and initializations.
Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus
- Computer ScienceINTERSPEECH
- 2022
Experiments suggest that the proposed neural architectural compression and mixed precision quantization techniques consistently outperform the uniform precision quantised baseline systems of comparable bit-widths in terms of word error rate (WER).
Mixed Precision DNN Quantization for Overlapped Speech Separation and Recognition
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
Novel mixed precision DNN quantization methods are proposed by applying locally variable bit-widths to individual TCN components of a TF masking based multi-channel speech separation system by automatically learning the optimal local precision settings using three techniques.
References
SHOWING 1-10 OF 39 REFERENCES
Alternating Multi-bit Quantization for Recurrent Neural Networks
- Computer ScienceICLR
- 2018
This work quantizes the network, both weights and activations, into multiple binary codes {-1,+1}, and forms the quantization as an optimization problem, which in both RNNs and feedforward neural networks achieves excellent performance and is extended to image classification tasks.
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
- Computer ScienceJ. Mach. Learn. Res.
- 2017
A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Limited-Memory BFGS Optimization of Recurrent Neural Network Language Models for Speech Recognition
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
A limited-memory Broyden Fletcher Goldfarb Shannon (L-BFGS) based second order optimization technique is proposed for RNNLMs that efficiently approximates the matrix-vector product between the inverse Hessian and gradient vector via a recursion over past gradients with a compact memory requirement.
Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM
- Computer ScienceAAAI
- 2018
This paper focuses on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network, and proposes to solve this problem using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods.
Trained Ternary Quantization
- Computer ScienceICLR
- 2017
This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet.
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
- Computer ScienceICLR
- 2016
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
Improving the speed of neural networks on CPUs
- Computer Science
- 2011
This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
- Computer Science
- 2016
A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Compressing Deep Convolutional Networks using Vector Quantization
- Computer ScienceArXiv
- 2014
This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.
Learning Structured Sparsity in Deep Neural Networks
- Computer ScienceNIPS
- 2016
The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers.