Pruning and Quantization for Deep Neural Network Acceleration: A Survey

  title={Pruning and Quantization for Deep Neural Network Acceleration: A Survey},
  author={Tailin Liang and C. John Glossner and Lei Wang and Shaobo Shi},
Abstract Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs. These challenges can be overcome through optimizations such as network compression. Network compression can often be realized with little loss of accuracy. In some cases accuracy may even improve. This paper provides a… Expand
Zero-Keep Filter Pruning for Energy/Power Efficient Deep Neural Networks
This work proposes a method that maximizes the number of zero elements in filters by replacing small values with zero and pruning the filter that has the lowest number of zeros and proves that this method shows better performance with many fewer non-zero elements with a marginal drop in accuracy. Expand
A Survey of Quantization Methods for Efficient Neural Network Inference
This article surveys approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. Expand
An Evaluation of Model Compression & Optimization Combinations
With time, machine learning models have increased in their scope, functionality and size. Consequently, the increased functionality and size of such models requires high-end hardware to both trainExpand
Design Space Exploration of Sparse Accelerators for Deep Neural Networks
This paper systematically examines overheads of supporting sparsity on top of an optimized dense core and introduces novel techniques to reuse resources of the same core to maintain high performance and efficiency when running single sparsity or dense models. Expand
Experiments on Properties of Hidden Structures of Sparse Neural Networks
Insight into experiments in which sparsity can be achieved through prior initialization, pruning, and during learning are provided, and questions on the relationship between the structure of Neural Networks and their performance are answered. Expand
Juvenile state hypothesis: What we can learn from lottery ticket hypothesis researches?
  • Di Zhang
  • Computer Science
  • ArXiv
  • 2021
A strategy that combines the idea of neural network structure search with a pruning algorithm to alleviate the difficulty of training or performance degradation of the sub-networks after pruning and the forgetting of the weights of the original lottery ticket hypothesis is proposed. Expand
Proof of concept of a fast surrogate model of the VMEC code via neural networks in Wendelstein 7-X scenarios
In magnetic confinement fusion research, the achievement of high plasma pressure is key to reaching the goal of net energy production. The magnetohydrodynamic (MHD) model is used to self-consistentlyExpand
Queen Jane Approximately: Enabling Efficient Neural Network Inference with Context-Adaptivity
A context-aware method for dynamically adjusting the width of an on-device neural network based on the input and context-dependent classification confidence is developed and demonstrated that such a network would save up to 37.8% energy and induce only 1% loss of accuracy, if used for continuous activity monitoring in the field of elderly care. Expand
Realization of Neural Network-based Optical Channel Equalizer in Restricted Hardware
We quantify the achievable reduction of the processing complexity of artificial neural network-based equalizers in a coherent optical channel using the pruning and quantization techniques. First, weExpand
Shifting Capsule Networks from the Cloud to the Deep Edge
The software kernels extend the Arm CMSIS-NN and RISC-V PULP-NN, to support capsule operations with 8-bit integers as operands, and propose a framework to perform post-training quantization of a CapsNet. Expand


Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation
This paper presents a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large. Expand
A Survey of Model Compression and Acceleration for Deep Neural Networks
This paper survey the recent advanced techniques for compacting and accelerating CNNs model developed, roughly categorized into four schemes: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. Expand
Recent advances in efficient computation of deep convolutional neural networks
A comprehensive survey of recent advances in network acceleration, compression, and accelerator design from both algorithm and hardware points of view is provided. Expand
Compression of convolutional neural networks: A short survey
The state-of-the-art in CNN compression is reviewed and it is concluded that the future CNN compression algorithms should be co-designed with hardware which will process deep learning algorithms. Expand
Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM
This paper focuses on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network, and proposes to solve this problem using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Expand
Post training 4-bit quantization of convolutional networks for rapid-deployment
This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset, and achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models. Expand
Accelerating Convolutional Neural Networks with Dynamic Channel Pruning
Comparing the increasing error with baseline methods (Filter Pruning, Channel Pruning and RNP), DCP outperforms other methods consistently as the speed-up ratio increasing, and the experiment show that DCP also consistently outperforms the baseline model whenever for C ifar10 and Cifar100. Expand
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Expand
Binary Neural Networks: A Survey
A comprehensive survey of algorithms proposed for binary neural networks, mainly categorized into the native solutions directly conducting binarization, and the optimized ones using techniques like minimizing the quantization error, improving the network loss function, and reducing the gradient error are presented. Expand
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliverExpand