• Corpus ID: 27494814

To prune, or not to prune: exploring the efficacy of pruning for model compression

  title={To prune, or not to prune: exploring the efficacy of pruning for model compression},
  author={Michael Zhu and Suyog Gupta},
Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. [] Key Method We investigate these two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning and can be seamlessly incorporated within…

Figures and Tables from this paper

Rethinking the Value of Network Pruning

It is found that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization, and the need for more careful baseline evaluations in future research on structured pruning methods is suggested.

Fine-tuning Pruned Networks with Linear Over-parameterization

A novel method is proposed that linearly over-parameterizes the compact layers in pruned networks to enlarge the number of fine-tuning parameters and then re-parametersizes them to the original layers after flne- Tuning.

PruNet: Class-Blind Pruning Method For Deep Neural Networks

It is demonstrated that retraining after pruning is essential to restore the accuracy of the network and can be complemented by other compression techniques, like weight sharing, quantization or fixed-point conversion, that allows to further reduce memory and computations.

What to Prune and What Not to Prune at Initialization

Two approaches to prune at initialization are presented to achieve higher sparsity while preserving performance and the efficacy of the said methods on Autoencoders and Fully Connected Multilayered Perceptrons is evaluated.

A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness

A large-scale analysis of popular model compression techniques which uncovers several intriguing patterns and shows the compatibility of CARDs with popular existing strategies, such as data augmentation and model size increase, and proposes a new robustness-improvement strategy that leverages the compactness of Cards via ensembling.

Effective Model Sparsification by Scheduled Grow-and-Prune Methods

A novel scheduled grow-and-prune (GaP) methodology without having to pre-train a dense model is proposed, which addresses the shortcomings of the previous work by repeatedly growing a subset of layers to dense and then pruning them back to sparse after some training.

Stochastic Model Pruning via Weight Dropping Away and Back

The Drop Pruning approach, which leverages stochastic optimization in the pruning process by introducing a drop strategy at each pruning step, can achieve competitive compression performance and accuracy on many benchmark tasks compared with state-of-the-art weights pruning and Bayesian training approaches.

Characterising Bias in Compressed Models

This work proposes its use as a human-in-the-loop auditing tool to surface a tractable subset of the dataset for further inspection or annotation by a domain expert and establishes that for CIE examples, compression amplifies existing algorithmic bias.

Back to Basics: Efficient Network Compression via IMP

It is found that basic IMP with SLR for retraining can outperform state-of-the-art pruning-duringtraining approaches without or with only little computational overhead, that the global magnitude selection criterion is largely competitive with more complex approaches and that only few retraining epochs are needed in practice to achieve most of the sparsity-vs-performance tradeoff of IMP.

Robustness to Pruning Predicts Generalization in Deep Neural Networks

Punability is introduced: the smallest fraction of the network’s parameters that can be kept while pruning without adversely affecting its training loss, which is similar to – but more predictive than – existing flatness-based measures.



Compression of Neural Machine Translation Models via Pruning

It is shown that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task.

The Power of Sparsity in Convolutional Neural Networks

2D convolution is generalized to use a channel-wise sparse connection structure and it is shown that this leads to significantly better results than the baseline approach for large networks including VGG and Inception V3.

Pruning Filters for Efficient ConvNets

This work presents an acceleration method for CNNs, where it is shown that even simple filter pruning techniques can reduce inference costs for VGG-16 and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.

Exploring Sparsity in Recurrent Neural Networks

This work proposes a technique to reduce the parameters of a network by pruning weights during the initial training of the network, which reduces the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply.

Structured Pruning of Deep Convolutional Neural Networks

The proposed work shows that when pruning granularities are applied in combination, the CIFAR-10 network can be pruned by more than 70% with less than a 1% loss in accuracy.

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

Learning both Weights and Connections for Efficient Neural Network

A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.

Rethinking the Inception Architecture for Computer Vision

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

Learning the Number of Neurons in Deep Networks

This paper proposes to make use of a group sparsity regularizer on the parameters of the network, where each group is defined to act on a single neuron, and shows that this approach can reduce the number of parameters by up to 80\% while retaining or even improving the network accuracy.

Cambricon-X: An accelerator for sparse neural networks

A novel accelerator is proposed, Cambricon-X, to exploit the sparsity and irregularity of NN models for increased efficiency and experimental results show that this accelerator achieves, on average, 7.23x speedup and 6.43x energy saving against the state-of-the-art NN accelerator.