• Corpus ID: 11164506

Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

  title={Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations},
  author={Liangzhen Lai and Naveen Suda and Vikas Chandra},
Deep convolutional neural network (CNN) inference requires significant amount of memory and computation, which limits its deployment on embedded devices. [] Key Method We show that using floating-point representation for weights is more efficient than fixed-point representation for the same bit-width and demonstrate it on popular large-scale CNNs such as AlexNet, SqueezeNet, GoogLeNet and VGG-16. We also show that such a representation scheme enables compact hardware multiply-and-accumulate (MAC) unit design…

Short floating-point representation for convolutional neural network inference

The experimental results show that the short floating-point representation with 8-bit total width achieves less-than-1-percentage-point degradation without the aid of retraining in the top-5 accuracy on very deep CNNs of up to 152 layers and gives more than a 60% area reduction in the ASIC implementation.

Phoenix: A Low-Precision Floating-Point Quantization Oriented Architecture for Convolutional Neural Networks

A normalization-oriented 8-bit floating-point quantization oriented processor, named Phoenix, is proposed to reduce storage and memory access with negligible accuracy loss and a hardware processor is designed to address the hardware inefficiency caused byfloating-point multiplier.

Quantization of deep neural networks for accumulator-constrained processors

Deep Neural Network Approximation for Custom Hardware

This article provides a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation and includes proposals for future research based on a thorough analysis of current trends.

Quantization of constrained processor data paths applied to convolutional neural networks

A layer-wise quantization heuristic to find a good fixed-point network approximation for platforms without wide accumulation registers is proposed and it is demonstrated that 16-bit accumulators are able to obtain a Top-1 classification accuracy within 1% of the floating-point baselines on the CIFAR10 and ILSVRC2012 image classification benchmarks.

Quantization of Constrained Processor Data Paths Applied to Convolutional Neural Networks

A layer-wise quantization heuristic to find a good fixed-point network approximation for platforms without wide accumulation registers is proposed and it is demonstrated that 16-bit accumulators are able to obtain a Top-1 classification accuracy within 1% of the floating-point baselines on the CIFAR-10 and ILSVRC2012 image classification benchmarks.

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

To the best of the knowledge, this is the first in-depth study to simplify one multiplication for CNN inference to one 4-bit MAC and implement four multiplications within one DSP while maintaining comparable accuracy without any re-training.

An Energy-Efficient Sparse Deep-Neural-Network Learning Accelerator With Fine-Grained Mixed Precision of FP8–FP16

This letter presents an energy-efficient DNN learning accelerator core supporting CNN and FC learning as well as inference with following three key features: 1) fine-grained mixed precision (FGMP); 2) compressed sparse DNNLearning/inference; and 3) input load balancer.

Zero-Centered Fixed-Point Quantization With Iterative Retraining for Deep Convolutional Neural Network-Based Object Detectors

In the proposed method, the center of the weight distribution is adjusted to zero by subtracting the mean of weight parameters before quantization, and the retraining process is iteratively applied to minimize the accuracy drop caused by quantization.

A Variable Precision Approach for Deep Neural Networks

The thesis investigates a hardware implementation of multiply-and-add with variable bit precision which can be adjusted at the computation time, and shows that the proposed system can achieve the accuracy up of to 88%.



Fixed Point Quantization of Deep Convolutional Networks

This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model.

Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets

This work investigates how using reduced precision data in Convolutional Neural Networks affects network accuracy during classification and proposes a method for finding a low precision configuration for a network while maintaining high accuracy.

Hardware-oriented Approximation of Convolutional Neural Networks

Ristretto is a model approximation framework that analyzes a given CNN with respect to numerical resolution used in representing weights and outputs of convolutional and fully connected layers and can condense models by using fixed point arithmetic and representation instead of floating point.

Training deep neural networks with low precision multiplications

It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications.

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

Ristretto is a fast and automated framework for CNN approximation which simulates the hardware arithmetic of a custom hardware accelerator, and can successfully condense CaffeNet and SqueezeNet to 8-bit.

Deep Learning with Limited Numerical Precision

The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.

Accelerating Deep Convolutional Networks using low-precision and sparsity

This work achieves the highest reported accuracy with extremely low-precision (2-bit) weight networks and builds a deep learning accelerator core, DLAC, that can achieve up to 1 TFLOP/mm2 equivalent for single- Precision floating-point operations.

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.