Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
@article{Ma2016ScalableAM, title={Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA}, author={Yufei Ma and Naveen Suda and Yu Cao and Jae-sun Seo and Sarma B. K. Vrudhula}, journal={2016 26th International Conference on Field Programmable Logic and Applications (FPL)}, year={2016}, pages={1-8} }
Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve…
Figures and Tables from this paper
133 Citations
ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler
- Computer ScienceIntegr.
- 2018
An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks
- Computer Science2017 27th International Conference on Field Programmable Logic and Applications (FPL)
- 2017
This work presents an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGAs and still keep the benefits of low-level hardware optimization.
Toolflows for Mapping Convolutional Neural Networks on FPGAs
- Computer ScienceACM Comput. Surv.
- 2018
A survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics, which include the supported applications, architectural choices, design space exploration methods, and achieved performance.
Fast generation of high throughput customized deep learning accelerators on FPGAs
- Computer Science2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig)
- 2017
An automatic code generation tool that synthesizes high throughput accelerators for CNN inferencing targeting broad types of CNNs and FPGAs, and adopts an algorithm-architecture co-design methodology based on frequency domain convolution.
CNN2Gate: Toward Designing a General Framework for Implementation of Convolutional Neural Networks on FPGA
- Computer ScienceArXiv
- 2020
An integrated framework (CNN2Gate) that supports compilation of a CNN model for an FPGA target and performs design-space exploration using a reinforcement learning agent and fits the design on different FPGAs with limited logic resources automatically is introduced.
Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs
- Computer Science2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
- 2017
This paper proposes a novel architecture for implementing Winograd algorithm on FPGAs and proposes an analytical model to predict the resource usage and reason about the performance, and uses the model to guide a fast design space exploration.
Optimising Convolutional Neural Networks for Reconfigurable Acceleration
- Computer Science
- 2017
A CNN model transpiler framework, Plumber, that can directly transform high-level models to FPGA designs with a novel model-hardware co-optimisation module is presented.
VHDL auto-generation tool for optimized hardware acceleration of convolutional neural networks on FPGA (VGT)
- Computer Science
- 2018
A VHDL generation tool (VGT), which through V HDL code (CNN architecture) can be on the fly generated for different CNN models (benchmarked and hand-tuned), where it is modular, highly parallel, reconfigurable, scalable, fully pipelined, and adaptive to differentCNN models.
Optimizing Frequency Domain Implementation of CNNs on FPGAs
- Computer Science
- 2017
An algorithmarchitecture co-design methodology based on the computational characteristics of CNN models and the features of underlying hardware to realize high performance designs to speed up various CNN models.
CaFPGA: An automatic generation model for CNN accelerator
- Computer ScienceMicroprocess. Microsystems
- 2018
27 References
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
- Computer ScienceFPGA
- 2016
This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth.
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
- Computer ScienceFPGA
- 2015
This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
- Computer ScienceFPGA
- 2016
This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.
14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks
- Computer Science2016 IEEE International Solid-State Circuits Conference (ISSCC)
- 2016
To achieve state-of-the-art accuracy, CNNs with not only a larger number of layers, but also millions of filters weights, and varying shapes are needed, which results in substantial data movement, which consumes significant energy.
DaDianNao: A Machine-Learning Supercomputer
- Computer Science2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
- 2014
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops
- 2014
The nn-X system is presented, a scalable, low-power coprocessor for enabling real-time execution of deep neural networks, able to achieve a peak performance of 227 G-ops/s, which translates to a performance per power improvement of 10 to 100 times that of conventional mobile and desktop processors.
Caffe: Convolutional Architecture for Fast Feature Embedding
- Computer ScienceACM Multimedia
- 2014
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
NeuFlow: A runtime reconfigurable dataflow processor for vision
- Computer ScienceCVPR 2011 WORKSHOPS
- 2011
In this paper we present a scalable dataflow hardware architecture optimized for the computation of general-purpose vision algorithms — neuFlow — and a dataflow compiler — luaFlow — that transforms…
Hardware accelerated convolutional neural networks for synthetic vision systems
- Computer ScienceProceedings of 2010 IEEE International Symposium on Circuits and Systems
- 2010
This system is fully digital and is a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images.
Deep Learning with Limited Numerical Precision
- Computer ScienceICML
- 2015
The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.