Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

@article{Suda2016ThroughputOptimizedOF,
  title={Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks},
  author={Naveen Suda and V. Chandra and Ganesh S. Dasika and Abinash Mohanty and Yufei Ma and S. Vrudhula and Jae-sun Seo and Yu Cao},
  journal={Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
  year={2016}
}
  • Naveen Suda, V. Chandra, +5 authors Yu Cao
  • Published 2016
  • Computer Science
  • Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. [...] Key Result We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.Expand
PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks
TLDR
An FPGA accelerator with a new architecture of deeply pipelined OpenCL kernels, which can be reused to explore new architectures for neural network accelerators and achieved a similar peak performance of 33.9 GOPS with a 34% resource reduction on DSP blocks. Expand
Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks
TLDR
A scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration and a systematic design space exploration methodology is put forward to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints. Expand
Convolutional Neural Networks Accelerators on FPGA Integrated With Hybrid Memory Cube
TLDR
Two new FPGA designs using the HMC as external memory to accelerate efficiently CNN are presented, a 32-bit fixed point design named Memory Conscious CNN Accelerator and a low power DNN accelerator, where data layout is done before the Pipeline Execution (PE). Expand
A Scalable FPGA Accelerator for Convolutional Neural Networks
TLDR
This paper proposes an FPGA accelerator with a scalable architecture of deeply pipelined OpenCL kernels, and has achieved a peak performance of 141 GOPS for convolution operation, and 103GOPS for the entire VGG-16 network that performs ImageNet classification on DE5-Net board. Expand
A Parametrizable High-Level Synthesis Library for Accelerating Neural Networks on FPGAs
TLDR
An High-Level Synthesis HLS library for CNN algorithms, which contains seven different streaming-capable CNN (plus two conversion) functions for creating large neural networks with deep pipelines, and is integrated into the HiFlipVX library, which is an open source HLS FPGA library for image processing and object detection. Expand
A high performance FPGA-based accelerator for large-scale convolutional neural networks
TLDR
This work proposes an end-to-end FPGA-based CNN accelerator with all the layers mapped on one chip so that different layers can work concurrently in a pipelined structure to increase the throughput. Expand
Toward Multi-FPGA Acceleration of the Neural Networks
TLDR
A generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs and achieve a near linear speedup with respect to the available single- FPGA solutions and can outperform other FPGA 2D accelerators by up to 8.4 times. Expand
An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform
TLDR
A complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration is proposed to efficiently support the design flow. Expand
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network
TLDR
An analytical performance model is proposed and an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs are performed and a new kernel design is proposed to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. Expand
A Novel FPGA Accelerator Design for Real-Time and Ultra-Low Power Deep Convolutional Neural Networks Compared With Titan X GPU
TLDR
This work has analyzed in detail the data dependency in the CNN accelerator and proposed specific pipelined operations and data organized manner to design a high throughput CNN accelerator on FPGA and optimized the kernel operations to obtain a high power efficiency. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
TLDR
This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly. Expand
Memory-centric accelerator design for Convolutional Neural Networks
TLDR
It is shown that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload and ensures that on-chip memory size is minimized, which reduces area and energy usage. Expand
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, +8 authors O. Temam
  • Computer Science
  • 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
TLDR
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. Expand
A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks
TLDR
The nn-X system is presented, a scalable, low-power coprocessor for enabling real-time execution of deep neural networks, able to achieve a peak performance of 227 G-ops/s, which translates to a performance per power improvement of 10 to 100 times that of conventional mobile and desktop processors. Expand
High Performance Convolutional Neural Networks for Document Processing
TLDR
Three novel approaches to speeding up CNNs are presented: a) unrolling convolution, b) using BLAS (basic linear algebra subroutines), and c) using GPUs (graphic processing units). Expand
A dynamically configurable coprocessor for convolutional neural networks
TLDR
This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks. Expand
Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL
TLDR
This work uses the Open Computing Language (OpenCL) to implement high-speed data compression (Gzip) on a field-programmable gate-arrays (FPGA) to achieve the high throughput of 3 GB/s with more than 2x compression ratio over standard compression benchmarks. Expand
Caffe: Convolutional Architecture for Fast Feature Embedding
TLDR
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Expand
Hardware accelerated convolutional neural networks for synthetic vision systems
TLDR
This system is fully digital and is a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images. Expand
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective. Expand
...
1
2
3
...