Heterogeneous Dual-Core Overlay Processor for Light-Weight CNNs

  title={Heterogeneous Dual-Core Overlay Processor for Light-Weight CNNs},
  author={Tiandong Zhao and Yunxuan Yu and Kun Wang and Lei He},
  journal={2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
  • Tiandong Zhao, Yunxuan Yu, Lei He
  • Published 1 May 2021
  • Computer Science
  • 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Convolutional neural networks (CNNs) have achieved extensive success on miscellaneous artificial intelligence applications such as image classification and object detection. A plethora of models emerge with different operators and architectures, gradually shifting attention from accuracy to efficiency in terms of speed and power, since VGG-like architecture from early stage has significant redundancy. Light-weight CNNs are proposed to reduce computation complexity and parameter amount… 


Redundancy-Reduced MobileNet Acceleration on Reconfigurable Logic for ImageNet Classification
This work highlights two levels of model redundancy which widely exist in modern CNNs and proposes an efficient system design for a Redundancy-Reduced MobileNet (RR-MobileNet) in which off-chip memory traffic is only used for inputs/outputs transfer while parameters and intermediate values are saved in on-chip BRAM blocks.
A High-Performance CNN Processor Based on FPGA for MobileNets
  • Di Wu, Yu Zhang, Yi Shan
  • Computer Science
    2019 29th International Conference on Field Programmable Logic and Applications (FPL)
  • 2019
A high-performance CNN processor based on FPGA is proposed in this paper and a special architecture called Channel Augmentation is designed to improve the efficiency in the first layer of MobileNets.
A CNN Accelerator on FPGA Using Depthwise Separable Convolution
A scalable high performance depthwise separable convolution optimized CNN accelerator that can be fit into an FPGA of different sizes and achieves 20x speedup if compared to CPU.
Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks
This paper proposes an FPGA-based overlay processor with a corresponding compilation flow for general LW-CNN accelerations, called Light-OPU, which is evaluated using all major LW- CNNs including the newly released MobileNetV3.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference
The Tile-Grained Pipeline Architecture (TGPA) is proposed, a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators.
A high performance FPGA-based accelerator for large-scale convolutional neural networks
This work proposes an end-to-end FPGA-based CNN accelerator with all the layers mapped on one chip so that different layers can work concurrently in a pipelined structure to increase the throughput.
Maximizing CNN accelerator efficiency through resource partitioning
This work presents a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers.
OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks
A domain-specific FPGA overlay processor, named OPU, is proposed to accelerate CNN networks, which offers software-like programmability for CNN end users, as CNN algorithms are automatically compiled into executable codes, which are loaded and executed by OPU without reconfiguration of FPGa for switch or update of CNN networks.
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).