• Corpus ID: 44138210

MPDCompress - Matrix Permutation Decomposition Algorithm for Deep Neural Network Compression

  title={MPDCompress - Matrix Permutation Decomposition Algorithm for Deep Neural Network Compression},
  author={Lazar Supic and Rawan Naous and Ranko Sredojevic and Aleksandra Faust and Vladimir M. Stojanovi{\'c}},
Deep neural networks (DNNs) have become the state-of-the-art technique for machine learning tasks in various applications. However, due to their size and the computational complexity, large DNNs are not readily deployable on edge devices in real-time. To manage complexity and accelerate computation, network compression techniques based on pruning and quantization have been proposed and shown to be effective in reducing network size. However, such network compression can result in irregular… 

Figures and Tables from this paper

On Minimizing Diagonal Block-Wise Differences for Neural Network Compression
A new algorithm is proposed, Memory-Efficient and Structure-Aware Compression (MESA), which effectively prunes the weights into a block diagonal structure to significantly boost the compression rate.
Tuning Algorithms and Generators for Efficient Edge Inference
A cross-layer software-hardware design framework that encompasses network training and model compression that is aware of and tuned to the underlying hardware architecture is created, creating a converged network that can be partitioned and efficiently scheduled on the target hardware platform, minimizing data movement, and improving the overall throughput and energy.
Edge Inference with NEM Relays
  • R. Naous, V. Stojanović
  • Computer Science
    2019 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S)
  • 2019
This work leverages the inherent sparsity feature of neural networks and static nature of the weight memory in the inference mode, to introduce NEMs-based neural accelerator engine, which achieves 6x better energy efficiency and 3x lower area than conventional CMOS designs.


Structured Deep Neural Network Pruning via Matrix Pivoting
This work introduces pruning via matrix pivoting as a way to improve network pruning by compromising between the design flexibility of architecture-oblivious and performance efficiency of architecture -aware pruning, the two dominant techniques for obtaining resource-efficient DNNs.
Scalpel: Customizing DNN pruning to the underlying hardware parallelism
This work implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results, including mean speedups of 3.54x, 2.61x, and 1.25x while reducing the model sizes by 88, 82%, and 53%.
Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
A simple and effective scheme to compress the entire CNN, called one-shot whole network compression, which addresses the important implementation level issue on 1?1 convolution, which is a key operation of inception module of GoogLeNet as well as CNNs compressed by the proposed scheme.
Learning Structured Sparsity in Deep Neural Networks
The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers.
Soft Weight-Sharing for Neural Network Compression
This paper shows that competitive compression rates can be achieved by using a version of  “soft weight-sharing” (Nowlan & Hinton, 1992) and achieves both quantization and pruning in one simple (re-)training procedure, exposing the relation between compression and the minimum description length (MDL) principle.
Trained Ternary Quantization
This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet.
Scalable and Sustainable Deep Learning via Randomized Hashing
This work presents a novel hashing-based technique to drastically reduce the amount of computation needed to train and test neural networks, and demonstrates the scalability and sustainability (energy efficiency) of the proposed algorithm via rigorous experimental evaluations on several datasets.
SEP-Nets: Small and Effective Pattern Networks
This paper proposes a simple yet powerful method for compressing the size of deep CNNs based on parameter binarization and proposes a new block structure codenamed the pattern residual block that adds transformed feature maps generated by convolutional neural networks to the pattern feature mapsgenerated by convolutions, based on which a small network with $\sim 1$ million parameters is designed.
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Compressing Neural Networks with the Hashing Trick
This work presents a novel network architecture, HashedNets, that exploits inherent redundancy in neural networks to achieve drastic reductions in model sizes, and demonstrates on several benchmark data sets that HashingNets shrink the storage requirements of neural networks substantially while mostly preserving generalization performance.