Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

@article{Ma2017OptimizingLO,
  title={Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks},
  author={Yufei Ma and Yu Cao and S. Vrudhula and Jae-sun Seo},
  journal={Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
  year={2017}
}
  • Yufei Ma, Yu Cao, +1 author Jae-sun Seo
  • Published 2017
  • Computer Science
  • Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. [...] Key Result The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state…Expand
Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA
TLDR
This paper quantitatively analyzing and optimizing the design objectives of the CNN accelerator based on multiple design variables and proposes a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. Expand
Optimizing of Convolutional Neural Network Accelerator
TLDR
Two methods of CNN accelerator are described to optimize CNN accelerator, reducing data precision and data-reusing, which can improve the performance of accelerator with the limited on-chip buffer and reduce the power consumption effectively. Expand
Designing efficient accelerator of depthwise separable convolutional neural network on FPGA
TLDR
A Field Programmable Gate Array-based depthwise separable CNN accelerator with all the layers working concurrently in a pipelined fashion to improve the system throughput and performance and a custom computing engine architecture to handle the dataflow between adjacent layers by using double-buffering-based memory channels is presented. Expand
An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs
TLDR
An FPGA accelerator for sparse CNNs is developed that can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGAs. Expand
An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs
TLDR
A sparsewise dataflow to skip the cycles of processing multiply-and-accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations is proposed. Expand
An Efficient FPGA Accelerator Design for Optimized CNNs Using OpenCL
TLDR
This work discusses a mixed precision approach which can counter the limited memory bandwidth issue within the CNN model and can achieve more than \(1.9 \times \) higher energy efficiency compared to an embedded Nvidia Jetson TX1 implementation of VGG-16. Expand
DSP-Efficient Hardware Acceleration of Convolutional Neural Network Inference on FPGAs
TLDR
This work proposes a transformation to the convolution computation, which leads to transformation of the accelerator design space and relaxes the pressure on the required DSP resources, which enables it to strike a favorable balance between utilization of the FPGA on-chip memory, logic, and DSP Resources. Expand
Software-Defined FPGA-Based Accelerator for Deep Convolutional Neural Networks: (Abstract Only)
TLDR
A software-defined architecture to cope with different CNN models while keeping high throughput and the highest flexibility while keeping relative high throughout is designed. Expand
Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA
  • J. Yu, Yiming Hu, +4 authors H. Yang
  • Computer Science
  • 2017 International Conference on Field Programmable Technology (ICFPT)
  • 2017
TLDR
This work designs an instruction driven CNN accelerator supporting Winograd algorithm and cross-layer scheduling, and improves the on-chip memory architecture for higher computation units utilization rate in Winog Rad. Expand
Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect
TLDR
This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses and scales starting from the single core level up to 128 cores, thereby showing the limits of the selected approach. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 20 REFERENCES
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
TLDR
This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly. Expand
Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
TLDR
This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. Expand
A high performance FPGA-based accelerator for large-scale convolutional neural networks
TLDR
This work proposes an end-to-end FPGA-based CNN accelerator with all the layers mapped on one chip so that different layers can work concurrently in a pipelined structure to increase the throughput. Expand
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
TLDR
This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. Expand
Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array
TLDR
This paper presents a flexible yet highly efficient 3D neuron array architecture that is a natural fit for convolutional layers and presents the technique to optimize its parameters including on-chip buffer sizes for a given set of resource constraint for modern FPGAs. Expand
Design space exploration of FPGA-based Deep Convolutional Neural Networks
TLDR
This paper proposes an FPGA-based accelerator architecture which leverages all sources of parallelism in DCNNs, and develops analytical feasibility and performance estimation models that take into account various design and platform parameters. Expand
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
TLDR
This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. Expand
14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks
TLDR
To achieve state-of-the-art accuracy, CNNs with not only a larger number of layers, but also millions of filters weights, and varying shapes are needed, which results in substantial data movement, which consumes significant energy. Expand
EIE: Efficient Inference Engine on Compressed Deep Neural Network
  • Song Han, Xingyu Liu, +4 authors W. Dally
  • Computer Science
  • 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
TLDR
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. Expand
Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks
TLDR
A novel dataflow, called row-stationary (RS), is presented, that minimizes data movement energy consumption on a spatial architecture and can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine local storage, direct inter-PE communication and spatial parallelism. Expand
...
1
2
...