• Corpus ID: 238531208

LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time

  title={LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time},
  author={Elvis Nunez and Maxwell Horton and Anish K. Prabhu and Anurag Ranjan and Ali Farhadi and Mohammad Rastegari},
When deploying deep learning models to a device, it is traditionally assumed that available computational resources (compute, memory, and power) remain static. However, real-world computing systems do not always provide stable resource guarantees. Computational resources need to be conserved when load from other processes is high or battery power is low. Inspired by recent works on neural network subspaces, we propose a method for training a compressible subspace of neural networks that… 

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS, achieves high accuracy for multiple models of different pruning ratios, and reduces their variance of inference latency for various frequencies, with minimal memory consumption.

Low-Loss Subspace Compression for Clean Gains against Multi-Agent Backdoor Attacks

This work contributes three defenses that yield improved multi-agent backdoor robustness that maximize accuracy w.r.t. clean labels and minimize that of poison labels.



To prune, or not to prune: exploring the efficacy of pruning for model compression

Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

Switchable Precision Neural Networks

This paper proposes a flexible quantization strategy, termed Switchable Precision neural Networks (SP-Nets), to train a shared network capable of operating at multiple quantization levels, which can adjust its precision on the fly according to instant memory, latency, power consumption and accuracy demands.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.

Once for All: Train One Network and Specialize it for Efficient Deployment

This work proposes to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost and propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

A quantization scheme is proposed that allows inference to be carried out using integer- only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware.

Layer-Wise Data-Free CNN Compression

A computationally efficient method for compressing a trained neural network without using real data and combining it with high-compute generative methods to improve upon their results is presented.

Slimmable Neural Networks

This work presents a simple and general method to train a single neural network executable at different widths, permitting instant and adaptive accuracy-efficiency trade-offs at runtime, and demonstrates better performance of slimmable models compared with individual ones across a wide range of applications.

Data-Free Quantization Through Weight Equalization and Bias Correction

We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer

Exploring Sparsity in Recurrent Neural Networks

This work proposes a technique to reduce the parameters of a network by pruning weights during the initial training of the network, which reduces the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply.

Compression of Neural Machine Translation Models via Pruning

It is shown that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task.