PruneTrain: fast neural network training by dynamic sparse model reconfiguration

@article{Lym2019PruneTrainFN,
  title={PruneTrain: fast neural network training by dynamic sparse model reconfiguration},
  author={Sangkug Lym and Esha Choukse and Siavash Zangeneh and Wei Wen and Sujay Sanghavi and Mattan Erez},
  journal={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  year={2019}
}
  • Sangkug Lym, Esha Choukse, M. Erez
  • Published 26 January 2019
  • Computer Science
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
State-of-the-art convolutional neural networks (CNNs) used in vision applications have large models with numerous weights. Training these models is very compute- and memory-resource intensive. Much research has been done on pruning or compressing these models to reduce the cost of inference, but little work has addressed the costs of training. We focus precisely on accelerating training. We propose PruneTrain, a cost-efficient mechanism that gradually reduces the training cost during training… 

Winning the Lottery Ahead of Time: Efficient Early Network Pruning

TLDR
Early Compression via Gradient Flow Preservation (EarlyCroP) is proposed, which efficiently extracts state-of-the-art sparse models before or early in training addressing challenge, and can be applied in a structured manner addressing challenge.

An Efficient End-to-End Deep Learning Training Framework via Fine-Grained Pattern-Based Pruning

TLDR
This paper proposes ClickTrain: an efficient and accurate end-to-end training and pruning framework for CNNs that reduces the end- to-end time cost of the state-of-the-art pruning-after-training methods and provides higher model accuracy and compression ratio via fine-grained architecture-preserving pruning.

ClickTrain: efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning

TLDR
Compared with the state-of-the-art pruning-during-training approach, ClickTrain provides significant improvements both accuracy and compression ratio on the tested CNN models and datasets, under similar limited training time.

Effective Model Sparsification by Scheduled Grow-and-Prune Methods

TLDR
A novel scheduled grow-and-prune (GaP) methodology without having to pre-train a dense model is proposed, which addresses the shortcomings of the previous work by repeatedly growing a subset of layers to dense and then pruning them back to sparse after some training.

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

TLDR
This work analyzes the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability and concludes that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism.

Only Train Once: A One-Shot Neural Network Training And Pruning Framework

TLDR
A framework that compresses DNNs into slimmer architectures with competitive performances and significant FLOPs reductions by Only-Train-Once (OTO), and proposes a novel optimization algorithm, Half-Space Stochastic Projected Gradient (HSPG), which outperforms the standard proximal methods on group sparsity exploration and maintains comparable convergence.

Weight Update Skipping: Reducing Training Time for Artificial Neural Networks

TLDR
A new training methodology for ANNs is proposed that exploits the observation of improvement of accuracy shows temporal variations which allow us to skip updating weights when the variation is minuscule, and virtually achieves the same accuracy with considerably less computational cost and reduces the time spent on training.

FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training

TLDR
FlexSA, a flexible systolic array architecture that dynamically reconfigures the syStolic array structure and offers multiple sub-systolic operating modes, is proposed, which is designed for energy- and memory bandwidth-efficient processing of tensors with different sizes and shapes.

SparseTrain: Leveraging Dynamic Sparsity in Software for Training DNNs on General-Purpose SIMD Processors

TLDR
This paper proposes SparseTrain, a software-only scheme to leverage dynamic sparsity during training on general-purpose SIMD processors, and exploits zeros introduced by the ReLU activation function to both feature maps and their gradients.

E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings

TLDR
This paper attempts to conduct more energy-efficient training of CNNs, so as to enable on-device training, by dropping unnecessary computations from three complementary levels: stochastic mini-batch dropping on the data level; selective layer update on the model level; and sign prediction for low-cost, low-precision back-propagation, on the algorithm level.

References

SHOWING 1-10 OF 52 REFERENCES

Structured Pruning of Deep Convolutional Neural Networks

TLDR
The proposed work shows that when pruning granularities are applied in combination, the CIFAR-10 network can be pruned by more than 70% with less than a 1% loss in accuracy.

Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures

TLDR
This paper introduces network trimming which iteratively optimizes the network by pruning unimportant neurons based on analysis of their outputs on a large dataset, inspired by an observation that the outputs of a significant portion of neurons in a large network are mostly zero.

Pruning Filters for Efficient ConvNets

TLDR
This work presents an acceleration method for CNNs, where it is shown that even simple filter pruning techniques can reduce inference costs for VGG-16 and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.

Learning both Weights and Connections for Efficient Neural Network

TLDR
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.

Compression-aware Training of Deep Networks

TLDR
It is shown that accounting for compression during training allows us to learn much more compact, yet at least as effective, models than state-of-the-art compression techniques.

Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning

TLDR
A new criterion based on an efficient first-order Taylor expansion to approximate the absolute change in training cost induced by pruning a network component is proposed, demonstrating superior performance compared to other criteria, such as the norm of kernel weights or average feature map activation.

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

TLDR
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

Learning Structured Sparsity in Deep Neural Networks

TLDR
The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers.

Scalpel: Customizing DNN pruning to the underlying hardware parallelism

TLDR
This work implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results, including mean speedups of 3.54x, 2.61x, and 1.25x while reducing the model sizes by 88, 82%, and 53%.

Mini-batch Serialization: CNN Training with Inter-layer Data Reuse

TLDR
The MBS CNN training approach is introduced that significantly reduces memory traffic by partially serializing mini-batch processing across groups of layers and optimizes reuse within on-chip buffers and balances both intra-layer and inter-layer reuse.
...