• Corpus ID: 221082752

Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference

  title={Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference},
  author={Mark Kurtz and Justin Kopinsky and Rati Gelashvili and Alexander Matveev and John Carr and Michael Goin and William M. Leiserson and Bill Nell and Nir Shavit and Dan Alistarh},
Optimizing deep neural networks for inference has recently become an extremely active area of research. One of the go-to solutions in this context is weight pruning, which aims to reduce computational and memory footprint by removing large subsets of the connections in a neural network. Surprisingly, much less attention has been given to exploiting sparsity in the activation maps, which tend to be naturally sparse in many settings thanks to the structure of rectified linear (ReLU) activation… 

Figures and Tables from this paper

Training for temporal sparsity in deep neural networks, application in video processing

A new DNN layer is introduced, called Delta Activation Layer, whose sole purpose is to promote temporal sparsity of activations during training, and is implemented as an extension of the standard Tensoflow-Keras library, and applied to train deep neural networks on the Human Action Recognition dataset.

Sparse Weight Activation Training

Sarse Weight Activation Training (SWAT), an algorithm that embodies these observations, is proposed that reduces computations by 50% to 80% with better accuracy at a given level of sparsity versus the Dynamic Sparse Graph algorithm.

Locally Sparse Neural Networks for Tabular Biomedical Data

This work designs a locally sparse neural network where the local sparsity is learned to identify the subset of most relevant features for each sample, and reduces model overfitting in low-sample-size data and obtains an interpretable model.

Neural Decoding With Optimization of Node Activations

It is shown that the neural decoder can be improved with two novel loss terms on the node’s activations, which has the same run time complexity and model size as the neural Belief Propagation decoder, while improving the decoding performance by up to up to 1.1dB on BCH codes.

Improved Projection Learning for Lower Dimensional Feature Maps

This work explores an improved method for compressing all feature maps of pre-trained CNNs to below a specified limit by means of learned projections trained via end-to-end finetuning, which can then be folded and fused into the pre- trained network.

Implicit Regularization of SGD via Thermophoresis

There exists an effective entropic force from SGD that pushes to reduce the gradient variance and this effect is proportional to squared learning rate and inverse batch size, and is more effective during the early phase of training when the model’s predictions are poor.



DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures

DeepHoyer is presented, a set of sparsity-inducing regularizers that are both differentiable almost everywhere and scale-invariant, and can be applied to both element-wise and structural pruning.

Exploiting the input sparsity to accelerate deep neural networks: poster

This paper proposes an end-to-end optimization pipeline to generate programs for the inference with sparse input that contains both domain-specific and general optimization techniques and is capable of generating efficient code without relying on the off-the-shelf libraries.

The State of Sparsity in Deep Neural Networks

It is shown that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization, and the need for large-scale benchmarks in the field of model compression is highlighted.

Accelerating Convolutional Neural Networks via Activation Map Compression

  • Georgios Georgiadis
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
A three-stage compression and acceleration pipeline that sparsifies, quantizes and entropy encodes activation maps of Convolutional Neural Networks is proposed, leading to both acceleration of inference and higher model accuracy.

Faster CNNs with Direct Sparse Convolutions and Guided Pruning

An efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns and a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures are developed.

To prune, or not to prune: exploring the efficacy of pruning for model compression

Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

Pruning Filters for Efficient ConvNets

This work presents an acceleration method for CNNs, where it is shown that even simple filter pruning techniques can reduce inference costs for VGG-16 and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.

Learning Activation Functions to Improve Deep Neural Networks

A novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent is designed, achieving state-of-the-art performance on CIFar-10, CIFAR-100, and a benchmark from high-energy physics involving Higgs boson decay modes.

Learning both Weights and Connections for Efficient Neural Network

A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.

WRPN: Wide Reduced-Precision Networks

This work reduces the precision of activation maps (along with model parameters) and increase the number of filter maps in a layer, and finds that this scheme matches or surpasses the accuracy of the baseline full-precision network.