Accuracy to Throughput Trade-Offs for Reduced Precision Neural Networks on Reconfigurable Logic

  title={Accuracy to Throughput Trade-Offs for Reduced Precision Neural Networks on Reconfigurable Logic},
  author={Jiang Su and Nicholas J. Fraser and Giulio Gambardella and Michaela Blott and Gianluca Durelli and David B. Thomas and Philip Heng Wai Leong and Peter Y. K. Cheung},
Modern Convolutional Neural Networks (CNNs) are typically based on floating point linear algebra based implementations. Recently, reduced precision Neural Networks (NNs) have been gaining popularity as they require significantly less memory and computational resources compared to floating point. This is particularly important in power constrained compute environments. However, in many cases a reduction in precision comes at a small cost to the accuracy of the resultant network. In this work, we… 

Structured Dynamic Precision for Deep Neural Networks Quantization

This work proposes an algorithm-architecture co-design, named Structured Dynamic Precision (SDP), which can achieve 29% performance gain and 51% energy reduction for the same level of model accuracy compared to the state-of-the-art dynamic quantization accelerator DRQ.

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

This article reviews the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification, and answers the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures.

Evaluation of Optimized CNNs on Heterogeneous Accelerators Using a Novel Benchmarking Approach

It is shown that channel pruning is most effective and works across most hardware platforms, with speedups directly correlated to the reduction in compute load, while FPGAs benefit the most from quantization.


The second generation of the FINN framework is described, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs that optimizes for given platforms, design targets, and a specific precision.

Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm

This paper compares the latency, accuracy, training time and hardware costs of neural networks compressed with the authors' new multi-objective evolutionary algorithm called NEMOKD, and with quantisation, and identifies a sweet spot of 3 bit precision in the trade-off between latency, hardware requirements, trainingTime and accuracy.

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

The second generation of the FINN framework is described, an end-to-end tool which enables design space exploration and automates the creation of fully customized inference engines on FPGAs that optimizes for given platforms, design targets and a specific precision.

On the Universal Approximability and Complexity Bounds of Quantized ReLU Neural Networks

This paper proves the universal approximability of quantized ReLU networks on a wide class of functions and provides upper bounds on the number of weights and the memory size for a given approximation error bound and the bit-width of weights for function-independent and function-dependent structures.

Applied Reconfigurable Computing. Architectures, Tools, and Applications: 16th International Symposium, ARC 2020, Toledo, Spain, April 1–3, 2020, Proceedings

A novel method for fast and accurate estimation of latency based on a Gaussian process parametrised by an analytic approximation and coupled with runtime data is introduced.

Photonic Integrated Reconfigurable Linear Processors as Neural Network Accelerators

The silicon-on-insulator processor outperforms the silicon nitride one in terms of footprint and energy efficiency and the lower extinction ratio of Mach–Zehnder elements in the latter platform limits their expressivity.

Accuracy, Training Time and Hardware Efficiency Trade-Offs for Quantized Neural Networks on FPGAs

Hardware needs of neural networks to execute often exceed FPGA resources, so FPGAs needed to execute these networks need to be increased to meet these requirements.



Scaling Binarized Neural Networks on Reconfigurable Logic

It is shown how padding can be employed on BNNs while still maintaining a 1-bit datapath and high accuracy, and it is believed that a large BNN requiring 1.2 billion operations per frame can classify images at 12 kFPS with 671 μs latency while drawing less than 41 W board power and classifying CIFAR-10 images at 88.7% accuracy.

Low precision arithmetic for deep learning

It is found that very low precision computation is sufficient not just for running trained networks but also for training them.

Resiliency of Deep Neural Networks under Quantization

This research shows that highly complex DNNs have the capability of absorbing the effects of severe weight quantization through retraining, but connection limited networks are less resilient.

Quantized Convolutional Neural Networks for Mobile Devices

This paper proposes an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models.

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bit width parameter gradients, is proposed and can achieve comparable prediction accuracy as 32-bit counterparts.

Fixed-point feedforward deep neural network design using weights +1, 0, and −1

The designed fixed-point networks with ternary weights (+1, 0, and -1) and 3-bit signal show only negligible performance loss when compared to the floating-point coun-terparts.

Compressing Neural Networks with the Hashing Trick

This work presents a novel network architecture, HashedNets, that exploits inherent redundancy in neural networks to achieve drastic reductions in model sizes, and demonstrates on several benchmark data sets that HashingNets shrink the storage requirements of neural networks substantially while mostly preserving generalization performance.

Deep Learning with Low Precision by Half-Wave Gaussian Quantization

An half-wave Gaussian quantizer (HWGQ) is proposed for forward approximation and shown to have efficient implementation, by exploiting the statistics of of network activations and batch normalization operations, and to achieve much closer performance to full precision networks than previously available low-precision networks.

High-Performance Neural Networks for Visual Object Classification

We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a