• Corpus ID: 209483537

Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations

  title={Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations},
  author={Yichi Zhang and Ritchie Zhao and Weizhe Hua and Nayun Xu and G. Edward Suh and Zhiru Zhang},
We propose precision gating (PG), an end-to-end trainable dynamic dual-precision quantization technique for deep neural networks. PG computes most features in a low precision and only a small proportion of important features in a higher precision to preserve accuracy. The proposed approach is applicable to a variety of DNN architectures and significantly reduces the computational cost of DNN execution with almost no accuracy loss. Our experiments indicate that PG achieves excellent results on… 

Figures and Tables from this paper

Structured Dynamic Precision for Deep Neural Networks Quantization

This work proposes an algorithm-architecture co-design, named Structured Dynamic Precision (SDP), which can achieve 29% performance gain and 51% energy reduction for the same level of model accuracy compared to the state-of-the-art dynamic quantization accelerator DRQ.

Dynamic Dual Gating Neural Networks

Dynamic dual gating is proposed, a new dynamic computing method to reduce the model complexity at run-time and can achieve higher accuracy under similar computing budgets compared with other dynamic execution methods.

FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations

The proposed FracBNN exploits fractional activations to substantially improve the accuracy of BNNs, and implements the entire optimized network architecture on an embedded FPGA (Xilinx Ultra96 v2) with the ability of real-time image classification.

Understanding the Impact of Dynamic Channel Pruning on Conditionally Parameterized Convolutions

This paper analyzes a recent method, Feature Boosting and Suppression (FBS), which dynamically assesses which channels contain the most important input-dependent features and prune the others based on a runtime threshold gating mechanism and discovers that substituting standard convolutional filters with input-specific filters, as described in CondConv, enables FBS to address this accuracy loss.

Adaptive Precision Training for Resource Constrained Devices

This work proposes Adaptive Precision Training (APT), which is able to save both total training energy cost and memory usage at the same time and allocates layer-wise precision dynamically so that the model learns quicker for longer time.

The Evolution of Domain-Specific Computing for Deep Learning

Trends in deep learning research that present new opportunities for domain-specific hardware architectures are looked at and how next-generation compilation tools might support them are explored.

Pushing the Envelope of Dynamic Spatial Gating technologies

This paper focuses on one such technology that targets unimportant features in the spatial domain of OFM, called Precision Gating (PG), and shows that PG leads to loss in accuracy when the authors push the MAC reduction achieved by a PG network.

Scalable Color Quantization for Task-Centric Image Compression

This work proposes a scalable color quantization method, where images with variable color space sizes can be extracted from a master image generated by a single DNN model, enabled by weighted color grouping which constructs a color palette using critical color components for the classification task.

Bayesian Bits: Unifying Quantization and Pruning

Bayesian Bits is introduced, a practical method for joint mixed precision quantization and pruning through gradient based optimization that can learn pruned, mixed precision networks that provide a better trade-off between accuracy and efficiency than their static bit width equivalents.

OverQ: Opportunistic Outlier Quantization for Neural Network Accelerators

This work proposes overwrite quantization (OverQ), a lightweight hardware technique that opportunistically increases bitwidth for activation outliers by overwriting nearby zeros, and imagines this technique can complement modern DNN accelerator designs to provide small increases in accuracy with insignificant area overhead.



Channel Gating Neural Networks

An accelerator is designed for channel gating, a dynamic, fine-grained, and hardware-efficient pruning scheme to reduce the computation cost for convolutional neural networks (CNNs), which optimizes CNN inference at run-time by exploiting input-specific characteristics.

PACT: Parameterized Clipping Activation for Quantized Neural Networks

It is shown, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets.

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

This paper proposes a novel quantization method that can ensure the balance of distributions of quantized values of QNNs without introducing extra computation during inference, has negligible impact on training speed, and is applicable to both convolutional neural networks and recurrent neural networks.

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

HAQ: Hardware-Aware Automated Quantization With Mixed Precision

The Hardware-Aware Automated Quantization (HAQ) framework is introduced which leverages the reinforcement learning to automatically determine the quantization policy, and takes the hardware accelerator's feedback in the design loop to generate direct feedback signals to the RL agent.

Fixed Point Quantization of Deep Convolutional Networks

This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model.

Value-aware Quantization for Training and Inference of Neural Networks

We propose a novel value-aware quantization which applies aggressively reduced precision to the majority of data while separately handling a small amount of large data in high precision, which

Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search

A novel differentiable neural architecture search (DNAS) framework is proposed to efficiently explore its exponential search space with gradient-based optimization and surpass the state-of-the-art compression of ResNet on CIFAR-10 and ImageNet.

Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating

Experimental results show that the proposed approach to dynamic pruning for CNN inference can significantly speed up state-of-the-art networks with a marginal accuracy loss, and enable a trade-off between performance and accuracy.

PredictiveNet: An energy-efficient convolutional neural network via zero prediction

PredictiveNet is proposed, which predicts the sparse outputs of the non-linear layers thereby bypassing a majority of computations in CNNs at runtime and can reduce the computational cost by a factor of 2.9χ compared to a state-of-the-art CNN, while incurring marginal accuracy degradation.