Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

  title={Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA},
  author={Yufei Ma and Naveen Suda and Yu Cao and Jae-sun Seo and Sarma B. K. Vrudhula},
  journal={2016 26th International Conference on Field Programmable Logic and Applications (FPL)},
  • Yufei Ma, Naveen Suda, S. Vrudhula
  • Published 1 August 2016
  • Computer Science
  • 2016 26th International Conference on Field Programmable Logic and Applications (FPL)
Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve… 
ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler
An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks
This work presents an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGAs and still keep the benefits of low-level hardware optimization.
Toolflows for Mapping Convolutional Neural Networks on FPGAs
A survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics, which include the supported applications, architectural choices, design space exploration methods, and achieved performance.
Fast generation of high throughput customized deep learning accelerators on FPGAs
An automatic code generation tool that synthesizes high throughput accelerators for CNN inferencing targeting broad types of CNNs and FPGAs, and adopts an algorithm-architecture co-design methodology based on frequency domain convolution.
CNN2Gate: Toward Designing a General Framework for Implementation of Convolutional Neural Networks on FPGA
An integrated framework (CNN2Gate) that supports compilation of a CNN model for an FPGA target and performs design-space exploration using a reinforcement learning agent and fits the design on different FPGAs with limited logic resources automatically is introduced.
Optimising Convolutional Neural Networks for Reconfigurable Acceleration
A CNN model transpiler framework, Plumber, that can directly transform high-level models to FPGA designs with a novel model-hardware co-optimisation module is presented.
VHDL auto-generation tool for optimized hardware acceleration of convolutional neural networks on FPGA (VGT)
A VHDL generation tool (VGT), which through V HDL code (CNN architecture) can be on the fly generated for different CNN models (benchmarked and hand-tuned), where it is modular, highly parallel, reconfigurable, scalable, fully pipelined, and adaptive to differentCNN models.
Optimizing Frequency Domain Implementation of CNNs on FPGAs
An algorithmarchitecture co-design methodology based on the computational characteristics of CNN models and the features of underlying hardware to realize high performance designs to speed up various CNN models.
Acceleration of Deep Learning on FPGA
A scalable and parameterized end-to-end ConvNet design using Intel FPGA SDK for OpenCL is proposed, which is 24.3X and 1.7X more energy efficient respectively.


Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth.
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.
14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks
To achieve state-of-the-art accuracy, CNNs with not only a larger number of layers, but also millions of filters weights, and varying shapes are needed, which results in substantial data movement, which consumes significant energy.
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, O. Temam
  • Computer Science
    2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks
The nn-X system is presented, a scalable, low-power coprocessor for enabling real-time execution of deep neural networks, able to achieve a peak performance of 227 G-ops/s, which translates to a performance per power improvement of 10 to 100 times that of conventional mobile and desktop processors.
Caffe: Convolutional Architecture for Fast Feature Embedding
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
NeuFlow: A runtime reconfigurable dataflow processor for vision
In this paper we present a scalable dataflow hardware architecture optimized for the computation of general-purpose vision algorithms — neuFlow — and a dataflow compiler — luaFlow — that transforms
Hardware accelerated convolutional neural networks for synthetic vision systems
This system is fully digital and is a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images.
Deep Learning with Limited Numerical Precision
The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.