• Corpus ID: 140095853

Full-stack Optimization for Accelerating CNNs with FPGA Validation

  title={Full-stack Optimization for Accelerating CNNs with FPGA Validation},
  author={Bradley McDanel and Sai Qian Zhang and H. T. Kung and Xin Dong},
We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference latency, energy efficiency, hardware utilization and inference accuracy. As a validation vehicle, we… 

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs

This article introduces a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs and demonstrates a high performance, which outperforms the state-of-the-art FPGA accelerators.



Maximizing CNN accelerator efficiency through resource partitioning

This work presents a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers.

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

This work systematically explore the trade-offs of hardware cost by searching the design variable configurations, and proposes a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance.

Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs

This paper proposes a fusion architecture that can fuse multiple layers naturally in CNNs, reusing the intermediate data, and designs an optimal algorithm to determine the fusion and algorithm strategy for each layer.

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.

Design Flow of Accelerating Hybrid Extremely Low Bit-Width Neural Network in Embedded FPGA

This work proposes a design flow for accelerating the extremely low bit-width neural network (ELB-NN) in embedded FPGAs with hybrid quantization schemes, which facilitates the design space exploration and simplifies the tradeoff between network accuracy and computation efficiency.

Fused-layer CNN accelerators

This work finds that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers, and is able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers.

ShiDianNao: Shifting vision processing closer to the sensor

This paper proposes an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator, designed down to the layout at 65 nm, with a modest footprint and consuming only 320 mW, but still about 30x faster than high-end GPUs.

Quantizing deep convolutional networks for efficient inference: A whitepaper

An overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations is presented and it is recommended that per-channel quantization of weights and per-layer quantized of activations be the preferred quantization scheme for hardware acceleration and kernel optimization.

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

This paper describes a novel approach of packing sparse convolutional neural networks into a denser format for efficient implementations using systolic arrays and demonstrates that in mitigating data privacy concerns the retraining can be accomplished with only fractions of the original dataset.

Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm

A novel model, dubbed Bi-Real net, which connects the real activations (after the 1-bit convolution and/or BatchNorm layer, before the sign function) to activations of the consecutive block, through an identity shortcut is proposed, which achieves up to 10% higher top-1 accuracy with more memory saving and lower computational cost.