Corpus ID: 225094123

INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices

  title={INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices},
  author={Yiwu Yao and Yuchao Li and Chengyu Wang and Tianhang Yu and Houjiang Chen and Xiaotang Jiang and Jun Yang and Jun Huang and Wei Lin and Hui Shu and Chengfei Lv},
The intensive computation of Automatic Speech Recognition (ASR) models obstructs them from being deployed on mobile devices. In this paper, we present a novel quantized Winograd optimization pipeline, which combines the quantization and fast convolution to achieve efficient inference acceleration on mobile devices for ASR models. To avoid the information loss due to the combination of quantization and Winograd convolution, a Range-Scaled Quantization (RSQ) training method is proposed to expand… Expand
1 Citations

Figures and Tables from this paper

Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition
Q-ASR, an integer-only, zero-shot quantization scheme for ASR models, is proposed, which generates synthetic data whose runtime statistics resemble the real data, and is used to calibrate models during quantization. Expand


Searching for Winograd-aware Quantized Networks
This work proposes a Winograd-aware formulation of convolution layers which exposes the numerical inaccuracies introduced by the Winog Rad transformations to the learning of the model parameters, enabling the design of competitive quantized models without impacting model size. Expand
MNN: A Universal and Efficient Inference Engine
The contributions of MNN include presenting a mechanism called pre-inference that manages to conduct runtime optimization that delivers thorough kernel optimization on operators to achieve optimal computation performance and introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight. Expand
PACT: Parameterized Clipping Activation for Quantized Neural Networks
It is shown, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets. Expand
Deep Learning towards Mobile Applications
An overview of the current challenges and representative achievements about pushing deep learning on mobile devices from three aspects: training with mobile data, efficient inference onMobile devices, and applications of mobile deep learning. Expand
Quantizing deep convolutional networks for efficient inference: A whitepaper
An overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations is presented and it is recommended that per-channel quantization of weights and per-layer quantized of activations be the preferred quantization scheme for hardware acceleration and kernel optimization. Expand
Weakly Supervised Construction of ASR Systems with Massive Video Data
This paper proposes an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR), and presents a weakly supervised framework for constructing ASR systems with massive video data. Expand
Learned Step Size Quantization
This work introduces a novel means to estimate and scale the task loss gradient at each weight and activation layer's quantizer step size, such that it can be learned in conjunction with other network parameters. Expand
Data-Free Quantization Through Weight Equalization and Bias Correction
We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computerExpand
Fast Algorithms for Convolutional Neural Networks
  • Andrew Lavin, Scott Gray
  • Computer Science
  • 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
A new class of fast algorithms for convolutional neural networks is introduced using Winograd's minimal filtering algorithms, which compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes. Expand
Deep-FSMN for Large Vocabulary Continuous Speech Recognition
An improved feedforward sequential memory networks (FSMN) architecture, namely Deep-FSMN (DFSMN), is presented by introducing skip connections between memory blocks in adjacent layers, which enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. Expand