• Corpus ID: 231648305

SparseDNN: Fast Sparse Deep Learning Inference on CPUs

  title={SparseDNN: Fast Sparse Deep Learning Inference on CPUs},
  author={Ziheng Wang},
  • Ziheng Wang
  • Published 20 January 2021
  • Computer Science
  • ArXiv
The last few years have seen gigantic leaps in algorithms and systems to support efficient deep learning inference. Pruning and quantization algorithms can now consistently compress neural networks by an order of magnitude. For a compressed neural network, a multitude of inference frameworks have been designed to maximize the performance of the target hardware. While we find mature support for quantized neural networks in production frameworks such as OpenVINO and MNN, support for pruned sparse… 

Figures from this paper

ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity

This work presents the first study on the unique aspects that arise when introducing sparsity at training time in FL workloads and proposes ZeroFL, a framework that relies on highly sparse operations to accelerate on-device training.

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

This article provides an overview of efficient deep learning methods, systems, and applications by introducing popular model compression methods, including pruning, factorization, quantization, as well as compact model design.

Towards efficient vision transformer inference: a first study of transformers on mobile devices

This study profiles the representative vision transformers to understand the inference performance on commercial mobile devices and the behind reasons; and studies multi-dimensional DNN acceleration approaches to achieve minimal latency, showing that it is too expensive for vision transformer inference on mobile devices.

Two sparsities are better than one: unlocking the performance benefits of sparse–sparse networks

This article demonstrates that high performance running weight-sparse networks can be achieved, and suggests that weight plus activation sparsity can be a potent combination for efficiently scaling future AI models.

Deep Noise Suppression for Real Time Speech Enhancement in a Single Channel Wide Band Scenario

This work uses a deep learning based approach to expand on two previously proposed architectures in the context of the Deep Noise Suppression Challenge carried out by Microsoft, and proposes variants that outperform the previously defined models in terms of denoising performance, complexity and real time efficiency.



SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

SarseRT is presented, a code generator that leverage unstructured sparsity to accelerate sparse linear algebra operations in deep learning inference on GPUs and shows speedups of over 5x on use cases in ResNet-50.

Sparse GPU Kernels for Deep Learning

This work develops high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks: sparse matrix–dense matrix multiplication and sampled dense– dense matrix multiplication.

Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs

This work proposes Escort, an efficient sparse convolutional neural networks on GPUs that orchestrate the parallelism and locality for the direct sparse Convolution kernel, and applies customized optimization techniques to further improve performance.

Balanced Sparsity for Efficient DNN Inference on GPU

This paper proposes a novel fine-grained sparsity approach, Balanced Sparsity, to achieve high model accuracy with commercial hardwares efficiently and adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services.

The State of Sparsity in Deep Neural Networks

It is shown that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization, and the need for large-scale benchmarks in the field of model compression is highlighted.

Faster CNNs with Direct Sparse Convolutions and Guided Pruning

An efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns and a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures are developed.

Rigging the Lottery: Making All Tickets Winners

This paper introduces a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.

EIE: Efficient Inference Engine on Compressed Deep Neural Network

  • Song HanXingyu Liu W. Dally
  • Computer Science
    2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression.

High-Performance Deep Learning via a Single Building Block

The batch-reduce GEMM kernel is introduced and it is shown how the most popular DL algorithms can be formulated with this kernel as the basic building-block, and high-performance CNN primitives are implemented.

Neural Network Compression Framework for fast model inference

A new framework for neural networks compression with fine-tuning that leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization, which can be successfully applied to a wide range of models to accelerate the inference time while keeping the original accuracy.