# Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

@article{Lai2017DeepCN, title={Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations}, author={Liangzhen Lai and Naveen Suda and Vikas Chandra}, journal={ArXiv}, year={2017}, volume={abs/1703.03073} }

Deep convolutional neural network (CNN) inference requires significant amount of memory and computation, which limits its deployment on embedded devices. [...] Key Method We show that using floating-point representation for weights is more efficient than fixed-point representation for the same bit-width and demonstrate it on popular large-scale CNNs such as AlexNet, SqueezeNet, GoogLeNet and VGG-16. We also show that such a representation scheme enables compact hardware multiply-and-accumulate (MAC) unit design… Expand

#### Figures, Tables, and Topics from this paper

#### 68 Citations

Phoenix: A Low-Precision Floating-Point Quantization Oriented Architecture for Convolutional Neural Networks

- Computer Science, Engineering
- 2020

A normalization-oriented 8-bit floating-point quantization oriented processor, named Phoenix, is proposed to reduce storage and memory access with negligible accuracy loss and a hardware processor is designed to address the hardware inefficiency caused byfloating-point multiplier. Expand

Quantization of deep neural networks for accumulator-constrained processors

- Computer Science, Engineering
- Microprocess. Microsystems
- 2020

Abstract We introduce an Artificial Neural Network (ANN) quantization methodology for platforms without wide accumulation registers. This enables fixed-point model deployment on embedded compute… Expand

Deep Neural Network Approximation for Custom Hardware

- Computer Science
- ACM Comput. Surv.
- 2019

This article provides a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation and includes proposals for future research based on a thorough analysis of current trends. Expand

Quantization of constrained processor data paths applied to convolutional neural networks

- 2018

Artificial Neural Networks (NNs) can effectively be used to solve many classification and regression problems, and deliver state-of-the-art performance in the application domains of natural language… Expand

Quantization of Constrained Processor Data Paths Applied to Convolutional Neural Networks

- Computer Science
- 2018 21st Euromicro Conference on Digital System Design (DSD)
- 2018

A layer-wise quantization heuristic to find a good fixed-point network approximation for platforms without wide accumulation registers is proposed and it is demonstrated that 16-bit accumulators are able to obtain a Top-1 classification accuracy within 1% of the floating-point baselines on the CIFAR-10 and ILSVRC2012 image classification benchmarks. Expand

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

- Computer Science, Engineering
- FPGA
- 2020

To the best of the knowledge, this is the first in-depth study to simplify one multiplication for CNN inference to one 4-bit MAC and implement four multiplications within one DSP while maintaining comparable accuracy without any re-training. Expand

An Energy-Efficient Sparse Deep-Neural-Network Learning Accelerator With Fine-Grained Mixed Precision of FP8–FP16

- Computer Science
- IEEE Solid-State Circuits Letters
- 2019

This letter presents an energy-efficient DNN learning accelerator core supporting CNN and FC learning as well as inference with following three key features: 1) fine-grained mixed precision (FGMP); 2) compressed sparse DNNLearning/inference; and 3) input load balancer. Expand

Zero-Centered Fixed-Point Quantization With Iterative Retraining for Deep Convolutional Neural Network-Based Object Detectors

- Computer Science
- IEEE Access
- 2021

In the proposed method, the center of the weight distribution is adjusted to zero by subtracting the mean of weight parameters before quantization, and the retraining process is iteratively applied to minimize the accuracy drop caused by quantization. Expand

Digital Neuron: A Hardware Inference Accelerator for Convolutional Deep Neural Networks

- Computer Science, Engineering
- ArXiv
- 2018

This paper provides a scheme that reuses input, weight, and output of all layers to reduce DRAM access and verified that the multiplication of integer numbers with 3-partial sub-integers does not cause significant loss of inference accuracy compared to 32-bit floating point calculation. Expand

A Variable Precision Approach for Deep Neural Networks

- Computer Science
- 2019 International Conference on Advanced Technologies for Communications (ATC)
- 2019

The thesis investigates a hardware implementation of multiply-and-add with variable bit precision which can be adjusted at the computation time, and shows that the proposed system can achieve the accuracy up of to 88%. Expand

#### References

SHOWING 1-10 OF 30 REFERENCES

Fixed Point Quantization of Deep Convolutional Networks

- Computer Science, Mathematics
- ICML
- 2016

This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. Expand

Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets

- Computer Science
- ArXiv
- 2015

This work investigates how using reduced precision data in Convolutional Neural Networks affects network accuracy during classification and proposes a method for finding a low precision configuration for a network while maintaining high accuracy. Expand

Hardware-oriented Approximation of Convolutional Neural Networks

- Computer Science
- ArXiv
- 2016

Ristretto is a model approximation framework that analyzes a given CNN with respect to numerical resolution used in representing weights and outputs of convolutional and fully connected layers and can condense models by using fixed point arithmetic and representation instead of floating point. Expand

Training deep neural networks with low precision multiplications

- Computer Science
- 2014

It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications. Expand

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

- Computer Science
- ArXiv
- 2016

Ristretto is a fast and automated framework for CNN approximation which simulates the hardware arithmetic of a custom hardware accelerator, and can successfully condense CaffeNet and SqueezeNet to 8-bit. Expand

Deep Learning with Limited Numerical Precision

- Computer Science, Mathematics
- ICML
- 2015

The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. Expand

Accelerating Deep Convolutional Networks using low-precision and sparsity

- Computer Science
- 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017

This work achieves the highest reported accuracy with extremely low-precision (2-bit) weight networks and builds a deep learning accelerator core, DLAC, that can achieve up to 1 TFLOP/mm2 equivalent for single- Precision floating-point operations. Expand

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

- Computer Science
- FPGA
- 2016

This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. Expand

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

- Computer Science
- ICLR
- 2016

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Expand

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

- Computer Science
- 2016

A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. Expand