Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

  title={Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration},
  author={Chen Wu and Mingyu Wang and Xinyuan Chu and Kun Wang and Lei He},
  journal={Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
  • Chen Wu, Mingyu Wang, Lei He
  • Published 23 February 2020
  • Computer Science
  • Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Low precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs, and (2) needing 16-bit floating point or 8-bit fixed point for a good accuracy. In this paper, we propose a low precision (8-bit) floating point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re… 
MP-OPU: A Mixed Precision FPGA-based Overlay Processor for Convolutional Neural Networks
This paper proposes a Mixed Precision FPGA-based Overlay Processor (MP-OPU) to fully leverage the advantages of mixed precision for both conventional and lightweight CNNs.
Reduced-Precision Acceleration of Radio-Astronomical Imaging on Reconfigurable Hardware
A reduced-precision implementation of the gridding component of the widely-used WSClean imaging application and proposes the first custom floating-point accelerator on a Xilinx Alveo U50 FPGA using High-Level Synthesis.
Efficient Design of Low Bitwidth Convolutional Neural Networks on FPGA with Optimized Dot Product Units
Designing hardware accelerators to run the inference of convolutional neural networks (CNN) is under intensive research. Several different architectures have been proposed along with


High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic
An optimized block-floating-point (BFP) arithmetic is adopted in the accelerator for efficient inference of deep neural networks in this paper, and improves the energy and hardware efficiency by three times.
Computation Error Analysis of Block Floating Point Arithmetic Oriented Convolution Neural Network Accelerator Design
The effects of word width definitions in BFP to the CNN performance without retraining are verified and the noise-to-signal ratio (NSR) upper bound is developed, which provides the promising guidance for BFP based CNN engine design.
Fixed Point Implementation of Tiny-Yolo-v2 using OpenCL on FPGA
This study proposes the fixed-point (16-bit) implementation of CNN-based object detection model: Tiny-Yolo-v2 on Cyclone V PCIe Development Kit FPGA board using High-Level-Synthesis (HLS) tool: OpenCL and achieves a peak performance of 21 GOPs under 100 MHz working frequency.
Exploration of Low Numeric Precision Deep Learning Inference Using Intel® FPGAs
A hardware design for FPGAs that takes advantage of the bandwidth, memory, power, and computation savings of limited numerical precision data and insights into the trade-offs between throughput and accuracy for various networks are provided.
Scalable high-performance architecture for convolutional ternary neural networks on FPGA
This work presents a highly versatile FPGA friendly architecture for TNN in which it can vary both the number of bits of the input data and the level of parallelism at synthesis time, allowing to trade throughput for hardware resources and power consumption.
Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs
This paper implements CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization, and provides an analytical model for performance and resource utilization and develops an automatic design space exploration framework.
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations
It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design.
Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA
This paper proposes Angel-Eye, a programmable and flexible CNN accelerator architecture, together with data quantization strategy and compilation tool, which achieves similar performance and delivers up to better energy efficiency than peer FPGA implementation on the same platform.
Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs
This paper proposes a novel architecture for implementing Winograd algorithm on FPGAs and proposes an analytical model to predict the resource usage and reason about the performance, and uses the model to guide a fast design space exploration.