Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration

  title={Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration},
  author={Yifan Gong and Geng Yuan and Zheng Zhan and Wei Niu and Zhengang Li and Pu Zhao and Yuxuan Cai and Sijia Liu and Bin Ren and Xue Lin and Xulong Tang and Yanzhi Wang},
  journal={ACM Transactions on Design Automation of Electronic Systems (TODAES)},
  • Yifan Gong, Geng Yuan, Yanzhi Wang
  • Published 22 November 2021
  • Computer Science
  • ACM Transactions on Design Automation of Electronic Systems (TODAES)
Weight pruning is an effective model compression technique to tackle the challenges of achieving real-time deep neural network (DNN) inference on mobile devices. However, prior pruning schemes have limited application scenarios due to accuracy degradation, difficulty in leveraging hardware acceleration, and/or restriction on certain types of DNN layers. In this paper, we propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations that are applicable to any… 


PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning
The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique---pattern-based pruning based on an extended ADMM solution framework---and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning.
PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-time Execution on Mobile Devices
PCONV, comprising a new sparsity dimension, – fine-grained pruning patterns inside the coarse- grained structures, is introduced, comprising two types of sparsities, Sparse Convolution Patterns (SCP) and connectivity sparsity generated from inter-convolution kernel pruning.
Achieving Real-Time Execution of Transformer-based Large-scale Models on Mobile with Compiler-aware Neural Architecture Optimization
This paper proposes the first compiler-aware neural architecture optimization framework (called CANAO), which can guarantee the identified model to meet both resource and real-time specifications of mobile devices, thus achieving real- time execution of large transformer-based models like BERT variants.
Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search
The proposed framework is the first to achieve real-time SR inference (with only tens of milliseconds per frame) for implementing 720p resolution with competitive image quality (in terms of PSNR and SSIM) on mobile platforms (Samsung Galaxy S20).
Non-structured DNN Weight Pruning Considered Harmful
This paper builds ADMM-NN-S, a recently proposed joint weight pruning and quantization framework, and develops a methodology for fair and fundamental comparison of non-structured and structured pruning in terms of both storage and computation efficiency, concluding that non- structures pruning is considered harmful.
BLK-REW: A Unified Block-based DNN Pruning Framework using Reweighted Regularization Method
A new block-based pruning framework that comprises a general and flexible structured pruning dimension as well as a powerful and efficient reweighted regularization method that achieves universal coverage for both CNNs and RNNs with real-time mobile acceleration and no accuracy compromise is proposed.
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
This paper proposes AutoML for Model Compression (AMC) which leverages reinforcement learning to efficiently sample the design space and can improve the model compression quality and achieves state-of-the-art model compression results in a fully automated way without any human efforts.
MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints
This work describes how several common DNNs, when subjected to state-of-the art optimizations, trade off accuracy for resource use such as memory, computation, and energy, and introduces two new and powerful DNN optimizations that exploit it.
ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers
ADMM-NN is the first algorithm-hardware co-optimization framework of DNNs using Alternating Direction Method of Multipliers (ADMM), a powerful technique to solve non-convex optimization problems with possibly combinatorial constraints, resulting in higher performance in model compression than the state-of-the-art.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.