PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

@article{Niu2020PatDNNAR,
  title={PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning},
  author={Wei Niu and Xiaolong Ma and Sheng Lin and Shihao Wang and Xuehai Qian and X. Lin and Yanzhi Wang and Bin Ren},
  journal={Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems},
  year={2020}
}
  • Wei Niu, Xiaolong Ma, Bin Ren
  • Published 1 January 2020
  • Computer Science
  • Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing Deep Neural Networks (DNNs) inference is still challenging considering the high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured… 
Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration
TLDR
A general, fine-grained structured pruning scheme and corresponding compiler optimizations that are applicable to any type of DNN layer while achieving high accuracy and hardware inference performance and are proposed to automatically derive the best-suited pruning regularity and block size for each layer of any given DNN.
Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution
TLDR
A compiler-aware SR neural architecture search (NAS) framework that con-ducts depth search and per-layer width search with adaptive SR blocks to achieve real-time SR inference for implementing 720p resolution with competitive SR performance on GPU/DSP of mobile platforms (Samsung Galaxy S21).
NPS: A Compiler-aware Framework of Unified Network Pruning for Beyond Real-Time Mobile Acceleration
TLDR
A general category of fine-grained structured pruning applicable to various DNN layers is proposed, and a comprehensive, compiler automatic code generation framework supporting different DNNs and different pruning schemes are proposed, which bridge the gap of model compression and NAS.
An Application-Oblivious Memory Scheduling System for DNN Accelerators
TLDR
The memory pressure issues of DNN training from the perspective of runtime systems are reviewed, the iterative, regularity and extremalization properties of memory access patterns for DNN workloads are identified and an application-oblivious memory scheduling system is proposed, AppObMem.
Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search
TLDR
The proposed framework is the first to achieve real-time SR inference (with only tens of milliseconds per frame) for implementing 720p resolution with competitive image quality (in terms of PSNR and SSIM) on mobile platforms (Samsung Galaxy S20).
Understanding and Optimizing Deep Learning Cold-Start Latency on Edge Devices
TLDR
NNV12 is presented, the first ondevice inference engine that optimizes for cold inference and employs a heuristic-based scheme to obtain a near-optimal kernel scheduling plan.
YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design
TLDR
This work proposes YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design, and proposes a novel block-punched pruning scheme for any kernel size.
Hardware-friendly User-specific Machine Learning for Edge Devices
TLDR
This work presents a hardware-friendly, lightweight pruning technique to create user-specific models directly on mobile platforms, while simultaneously executing inferences, and proposes architectural support to prune user- specific models on a systolic edge ML inference accelerator.
NPAS: A Compiler-aware Framework of Unified Network Pruning and Architecture Search for Beyond Real-Time Mobile Acceleration
  • Zhengang Li, Geng Yuan, Xue Lin
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
A general category of fine-grained structured pruning applicable to various DNN layers is proposed, and a comprehensive, compiler automatic code generation framework supporting different DNNs and different pruning schemes are proposed, which bridge the gap of model compression and NAS.
CoCoPIE: Making Mobile AI Sweet As PIE -Compression-Compilation Co-Design Goes a Long Way
TLDR
This article maintains that with effective compression-compiler co-design, it is possible to enable real-time artificial intelligence on mainstream end devices without special hardware support.
...
...

References

SHOWING 1-10 OF 74 REFERENCES
PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-time Execution on Mobile Devices
TLDR
PCONV, comprising a new sparsity dimension, – fine-grained pruning patterns inside the coarse- grained structures, is introduced, comprising two types of sparsities, Sparse Convolution Patterns (SCP) and connectivity sparsity generated from inter-convolution kernel pruning.
DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission
TLDR
DeftNN is a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance, and is composed of two novel optimization techniques– synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TLDR
TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints
TLDR
This work describes how several common DNNs, when subjected to state-of-the art optimizations, trade off accuracy for resource use such as memory, computation, and energy, and introduces two new and powerful DNN optimizations that exploit it.
DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices
  • N. Lane, S. Bhattacharya, F. Kawsar
  • Computer Science
    2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)
  • 2016
TLDR
Experiments show, DeepX can allow even large-scale deep learning models to execute efficently on modern mobile processors and significantly outperform existing solutions, such as cloud-based offloading.
On-Demand Deep Model Compression for Mobile Devices: A Usage-Driven Model Selection Framework
TLDR
A usage-driven selection framework is developed, referred to as AdaDeep, to automatically select a combination of compression techniques for a given DNN, that will lead to an optimal balance between user-specified performance goals and resource constraints.
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
TLDR
This paper proposes AutoML for Model Compression (AMC) which leverages reinforcement learning to efficiently sample the design space and can improve the model compression quality and achieves state-of-the-art model compression results in a fully automated way without any human efforts.
Deep Learning on Mobile Devices - A Review
TLDR
Hardware architectures for mobile deep learning, including Field Programmable Gate Arrays, Application Specific Integrated Circuit (ASIC), and recent mobile Graphic Processing Units (GPUs) are discussed, and Size, Weight, Area and Power (SWAP) considerations and their relation to algorithm optimizations are presented.
DeepCache: Principled Cache for Mobile Deep Vision
TLDR
The implementation of DeepCache works with unmodified deep learning models, requires zero developer's manual effort, and is therefore immediately deployable on off-the-shelf mobile devices.
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
TLDR
BinaryConnect is introduced, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated, and near state-of-the-art results with BinaryConnect are obtained on the permutation-invariant MNIST, CIFAR-10 and SVHN.
...
...