RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition

  title={RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition},
  author={Peiyan Dong and Siyue Wang and Wei Niu and Chengming Zhang and Sheng Lin and Z. Li and Yifan Gong and Bin Ren and X. Lin and Yanzhi Wang and Dingwen Tao},
  journal={2020 57th ACM/IEEE Design Automation Conference (DAC)},
Recurrent neural networks (RNNs) based automatic speech recognition has nowadays become promising and important on mobile devices such as smart phones. However, previous RNN compression techniques either suffer from hardware performance overhead due to irregularity or significant accuracy loss due to the preserved regularity for hardware friendliness. In this work, we propose RTMobile that leverages both a novel block-based pruning approach and compiler optimizations to accelerate RNN inference… 

Puncturing the memory wall: Joint optimization of network compression with approximate memory for ASR application

A joint-optimized scheme of network compression with approximate memory for the economical ASR system, including a novel pruning technique coordinated with low-precision quantization and the approximate memory scheme, and an ASR-adapted incremental retraining method to further obtain optimal power saving.

Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration

A general, fine-grained structured pruning scheme and corresponding compiler optimizations that are applicable to any type of DNN layer while achieving high accuracy and hardware inference performance and results demonstrate that these methods outperform the state-of-the-art DNN optimization framework.

Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

A compiler-aware SR neural architecture search (NAS) framework that conducts depth search and per-layer width search with adaptive SR blocks to achieve real-time SR inference for implementing 720p resolution with competitive SR performance on GPU/DSP of mobile platforms.

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices Based on Fine-Grained Structured Weight Sparsity

  • Wei NiuZhengang Bin Ren
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2022
This paper designs a novel mobile inference acceleration framework GRIM that is General to both convolutional neural networks (CNNs) and recurrent Neural networks (RNNs), and that achieves Real-time execution and high accuracy, leveraging fine-grained structured sparse model Inference and compiler optimizations for Mobiles.

Squeeze for Sneeze: Compact Neural Networks for Cold and Flu Recognition

Key results presented indicate that pruning, then quantising a network can reduce the number of operational weights by almost 90 % and the overall size of the network can be reduced by almost 95 %, as measured in MB, without affecting overall recognition performance.

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

A novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs and interesting analysis on the gate probabilities and the input-dependency, which helps to better understand deep encoders.

6.7ms on Mobile with over 78% ImageNet Accuracy: Unified Network Pruning and Architecture Search for Beyond Real-Time Mobile Acceleration

This work proposes a general category of fine-grained structured pruning applicable to various DNN layers, and a comprehensive, compiler automatic code generation framework supporting different DNNs and different pruning schemes, which bridge the gap of model compression and NAS.

Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search

The proposed framework is the first to achieve real-time SR inference (with only tens of milliseconds per frame) for implementing 720p resolution with competitive image quality (in terms of PSNR and SSIM) on mobile platforms (Samsung Galaxy S20).

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

The basic idea of this work is to work at an operator view of DNNs, but expand fusion opportunities by developing a classification of both individual operators and their combinations, and includes a novel mathematical-property-based graph rewriting framework to reduce evaluation costs and facilitate subsequent operator fusion.



ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

This work proposes a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy, and proposes a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated L STM data flow.

E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs

  • Zhe LiCaiwen Ding Yanzhi Wang
  • Computer Science
    2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • 2019
The Efficient RNN (E-RNN) framework is presented, and the alternating direction method of multipliers (ADMM) technique is used for more accurate block-circulant training, and two design explorations providing guidance on block size and reducing RNN training trials are presented.

FPGA-based accelerator for long short-term memory recurrent neural networks

This work presents an FPGA-based accelerator for LSTM-RNNs that optimizes both computation performance and communication requirements and significantly outperforms previous approaches.

FPGA Acceleration of Recurrent Neural Network Based Language Model

This work presents an FPGA implementation framework for RNNLM training acceleration and improves the parallelism of RNN training scheme and reduces the computing resource requirement for computation efficiency enhancement.

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs

This work proposes a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs and achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art L STM implementation under the same experimental setup.

Acceleration of LSTM With Structured Pruning Method on FPGA

A structured pruning method that can not only reduce the LSTM model’s size without loss of prediction accuracy but also eliminate the imbalance computation and irregular memory accesses is proposed to speed up the inference on FPGA.

Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC

This paper proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU, and studied the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU.

DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications

This paper proposes DeepMon, a mobile deep learning inference system to run a variety of deep learning inferences purely on a mobile device in a fast and energy-efficient manner and designs a suite of optimization techniques to efficiently offload convolutional layers to mobile GPUs and accelerate the processing.

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

This work presents Bank-Balanced Sparsity (BBS), a novel sparsity pattern that can maintain model accuracy at a high sparsity level while still enable an efficient FPGA implementation, and proposes a decoding-free sparse matrix format, Compressed Sparse Banks (CSB), that transparently exposes inter-bank parallelism in BBS to hardware.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.