Fast convolutional neural networks on FPGAs with hls4ml

@article{Aarrestad2021FastCN,
  title={Fast convolutional neural networks on FPGAs with hls4ml},
  author={Thea Klaeboe Aarrestad and Vladimir Loncar and Nicol{\`o} Ghielmetti and Maurizio Pierini and Sioni Summers and Jennifer Ngadiuba and Christoffer Petersson and Hampus Linander and Yutaro Iiyama and Giuseppe Di Guglielmo and Javier Mauricio Duarte and Philip C. Harris and Dylan S. Rankin and Sergo Jindariani and Kevin Pedro and Nhan Viet Tran and Miaoyuan Liu and Edward Kreinar and Zhenbin Wu and Duc Hoang},
  journal={Machine Learning: Science and Technology},
  year={2021},
  volume={2}
}
We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on field-programmable gate arrays (FPGAs). By extending the hls4ml library, we demonstrate an inference latency of 5 µs using convolutional architectures, targeting microsecond latency applications like those at the CERN Large Hadron Collider. Considering benchmark models trained on the Street View House Numbers Dataset, we demonstrate various methods for model compression in… 

De-specializing an HLS library for Deep Neural Networks: improvements upon hls4ml

Custom hardware accelerators for Deep Neural Networks are increasingly popular: in fact, the flexibility and performance offered by FPGAs are well-suited to the computational effort and low latency

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

Hls4ml, an open-source software-hardware co-design workflow to interpret and translate machine learning algorithms for implementation in FPGAs and ASICs specifically to support domain scientists, is developed.

Nanosecond machine learning event classification with boosted decision trees in FPGA for high energy physics

This work presents a novel implementation of classification using the machine learning/artificial intelligence method called boosted decision trees (BDT) on field programmable gate arrays (FPGA) and aims to provide decisions at the lowest latency values for real-time event classification.

FPGA-Based Acceleration of Convolutional Neural Network for Gesture Recognition Using mm-Wave FMCW Radar

This study uses CNNs to classify hand gestures obtained using mmWave Frequency-Modulated Continuous Wave radar, and discusses the complete workflow from radar data preprocessing and model compression (quantization and pruning) to the final implementation of the CNN accelerator on FPGAs.

Graph Neural Networks for Charged Particle Tracking on FPGAs

An automated translation workflow is introduced, integrated into a broader tool called hls4ml, for converting GNNs into firmware for field-programmable gate arrays (FPGAs), which could enable the inclusion of charged particle tracking Gnns at the trigger level for HL-LHC experiments.

Tailor: Altering Skip Connections for Resource-Efficient Inference

It is argued that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware implementation with minimal to no accuracy loss.

FastStamp

AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture

Compared to previous accelerators with similar object detection accuracy, the proposed accelerator reaches much higher throughput even with less FPGA resources of LUTs, registers, and DSPs, showing much higher efficiency.

Application Specific Instruction-Set Processors for Machine Learning Applications

A RISC-V-based ASIP for machine learning applications is developed and three main design space optimization of ASIPs will be explored; specialized application-specific ISA, vector processing, and multi-core architecture (for task-level parallelism).

FPGA implementation of a Convolutional Neural Network for image classification

The proposed technique for implementing a Convolutional Neural Network is ready for a hardware FPGA implementation and it can be very useful for real-time embedded applications.

References

SHOWING 1-10 OF 70 REFERENCES

Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml

We present the implementation of binary and ternary neural networks in the hls4ml library, designed to automatically convert deep neural network models to digital circuits with field-programmable

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture that implements fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements is presented.

FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

  • Yijin GuanHao Liang J. Cong
  • Computer Science
    2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
  • 2017
FP-DNN (Field Programmable DNN), an end-to-end framework that takes TensorFlow-described DNNs as input, and automatically generates the hardware implementations on FPGA boards with RTL-HLS hybrid templates, is proposed.

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs

Convolutional Neural Networks (ConvNets) are a powerful Deep Learning model, providing state-of-the-art accuracy to many emerging classification problems. However, ConvNet classification is a

Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array

This paper presents a flexible yet highly efficient 3D neuron array architecture that is a natural fit for convolutional layers and presents the technique to optimize its parameters including on-chip buffer sizes for a given set of resource constraint for modern FPGAs.

Toolflows for Mapping Convolutional Neural Networks on FPGAs

A survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics, which include the supported applications, architectural choices, design space exploration methods, and achieved performance.

From high-level deep neural models to FPGAs

DnnWeaver is devised, a framework that automatically generates a synthesizable accelerator for a given DNN, FPGA pair from a high-level specification in Caffe that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGAs.

fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)

In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art in several Artificial Intelligence tasks. Across the range of applications, the performance needs vary

Latency-driven design for FPGA-based convolutional neural networks

A latency-driven design methodology for mapping ConvNets on FPGAs that employs novel transformations over a Synchronous Dataflow-based modelling framework together with a latency-centric optimisation procedure in order to efficiently explore the design space targeting low-latency designs.

Caffeinated FPGAs: FPGA framework For Convolutional Neural Networks

This work presents a modified version of the popular CNN framework Caffe, with FPGA support, which allows for classification using CNN models and specialized FPN implementations with the flexibility of reprogramming the device when necessary, seamless memory transactions between host and device, simple-to-use test benches, and the ability to create pipelined layer implementations.
...