• Corpus ID: 233296249

Accelerating Sparse Deep Neural Networks

@article{Mishra2021AcceleratingSD,
  title={Accelerating Sparse Deep Neural Networks},
  author={Asit K. Mishra and Jorge Albericio Latorre and Jeff Pool and Darko Stosic and Dusan Stosic and Ganesh Venkatesh and Chong Yu and Paulius Micikevicius},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.08378}
}
As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity – encouraging zero values in parameters that can then be discarded from storage or computations. While most research focuses on high levels of sparsity, there are challenges in universally maintaining model accuracy as well as achieving significant speedups over modern matrix… 
Dual-side Sparse Tensor Core
TLDR
This work proposes a novel architecture to efficiently harness the dual-side sparsity (i.e., weight and activation sparsity) and proposed a novel, unexplored paradigm that combines outer-product computation primitive and bitmap-based encoding format.
Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks
TLDR
Complementary Sparsity is introduced, a novel technique that significantly improves the performance of dual sparse networks on existing hardware and suggests that weight plus activation sparsity can be a potent combination for efficiently scaling future AI models.
SPDY: Accurate Pruning with Speedup Guarantees
TLDR
SPYD is a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, while minimizing accuracy loss, and is compatible with most existing pruning approaches.
1xN Pattern for Pruning Convolutional Neural Networks
TLDR
This paper proposes a novel 1×N pruning pattern that prunes consecutive N output kernels with the same input channel index into one block, which serves as a basic pruning granularity of thisPruning pattern.
Optimal Fine-Grained N: M sparsity for Activations and Neural Gradients
TLDR
It is shown that optimal pruning of the neural gradients requires an unbiased minimum-variance pruning mask, and design such specialized masks are designed, and it is found that in most cases, 1:2 sparsity is sufficient for training, and 2:4Sparsity is usually enough when this is not the case.
AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks
TLDR
This paper presents a general approach called Alternating Compressed/DeCompressed (AC/DC) training of DNNs, demonstrates convergence for a variant of the algorithm, and shows that AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets; at high sparsity levels, AC/ DC even outperformsexisting methods that rely on accurate pre-trained dense models.
Group Fisher Pruning for Practical Network Compression
TLDR
A general channel pruning approach that can be applied to various complicated structures, and particularly, a layer grouping algorithm to find coupled channels automatically and derives a unified metric based on Fisher information to evaluate the importance of a single channel and coupled channels.
Channel Permutations for N: M Sparsity
TLDR
An ablation study to show the importance of each part of the search algorithm, experimental results showing correlation between the quality metric and final network accuracy, improved sparse network accuracy using the techniques with insignificant overhead to training time, and the transformation of unstructured to structured sparse workloads.
Post-Training Sparsity-Aware Quantization
TLDR
This paper proposes a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities, and achieves minor accuracy degradation, 2× speedup over widely used hardware architectures, and a practical hardware implementation.
NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM
TLDR
This work proposes to formulate the NxM sparsity as a constrained optimization problem and use Alternating Direction Method of Multipliers (ADMM) to optimize the downstream tasks while taking the underlying hardware constraints into consideration, generating sparsified Transformer networks that achieve high accuracy while being able to effectively execute on newly released hardware.
...
1
2
...

References

SHOWING 1-10 OF 63 REFERENCES
Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs
TLDR
A novel pruning algorithm is devised to improve the workload balance and reduce the decoding overhead of the sparse neural networks and new instructions and micro-architecture optimization are proposed in Tensor Core to adapt to the structurally sparse Neural networks.
Structured Pruning of Deep Convolutional Neural Networks
TLDR
The proposed work shows that when pruning granularities are applied in combination, the CIFAR-10 network can be pruned by more than 70% with less than a 1% loss in accuracy.
Learning Structured Sparsity in Deep Neural Networks
TLDR
The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers.
Balanced Sparsity for Efficient DNN Inference on GPU
TLDR
This paper proposes a novel fine-grained sparsity approach, Balanced Sparsity, to achieve high model accuracy with commercial hardwares efficiently and adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services.
Block-Sparse Recurrent Neural Networks
TLDR
Two different approaches to induce block sparsity in RNNs are investigated: pruning blocks of weights in a layer and using group lasso regularization with pruning to create blocks ofweights with zeros, which can create block-sparse RNN's with sparsity ranging from 80% to 90% with a small loss in accuracy.
Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N: M Transposable Masks
TLDR
A novel transposable fine-grained sparsity mask is suggested, which guarantees that both the weight matrix and its transpose follow the same sparsity pattern; thus, the matrix multiplication required for passing the error backward can also be accelerated.
Exploring the Granularity of Sparsity in Convolutional Neural Networks
  • Huizi Mao, Song Han, W. Dally
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2017
TLDR
This analysis, based on the framework of a recent sparse convolutional neural network (SCNN) accelerator, demonstrates that coarse-grained sparsity saves 30% – 35% of memory references compared with fine-graining sparsity.
Exploring Sparsity in Recurrent Neural Networks
TLDR
This work proposes a technique to reduce the parameters of a network by pruning weights during the initial training of the network, which reduces the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply.
Mesh-TensorFlow: Deep Learning for Supercomputers
TLDR
Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.
Campfire: Compressible, Regularization-Free, Structured Sparse Training for Hardware Accelerators
TLDR
The results show that with 70% target sparsity, over 75% top-1 accuracy is achievable and the proposed training methodology Campfire explores pruning at granularities within a convolutional kernel and filter.
...
1
2
3
4
5
...