Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA

@article{Liu2018OptimizingCS,
  title={Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA},
  author={Shuanglong Liu and Hongxiang Fan and Xinyu Niu and Ho-Cheung Ng and Yang Chu and Wayne Luk},
  journal={ACM Transactions on Reconfigurable Technology and Systems (TRETS)},
  year={2018},
  volume={11},
  pages={1 - 22}
}
Convolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive, which limits their applicability to real-time applications… 

A Unified Hardware Architecture for Convolutions and Deconvolutions in CNN

A scalable neural network hardware architecture for image segmentation is proposed that combines convolution and deconvolution operations by sharing the same computing resources, and access to on-chip and off-chip memories is optimized to alleviate the burden introduced by partial sum.

Optimizing Fully Spectral Convolutional Neural Networks on FPGA

  • Shuanglong LiuW. Luk
  • Computer Science
    2020 International Conference on Field-Programmable Technology (ICFPT)
  • 2020
A fully spectral CNN approach is presented by proposing a novel spectral Rectified Linear Unit activation function to avoid multiple compute-intensive domain transformations and maintains the non-linearity in the network and takes into account the hardware efficiency in algorithm design.

Towards an Efficient Accelerator for DNN-Based Remote Sensing Image Segmentation on FPGAs

  • Shuanglong LiuW. Luk
  • Computer Science
    2019 29th International Conference on Field Programmable Logic and Applications (FPL)
  • 2019
A uniform architecture to efficiently implement both convolution and deconvolution in one vector multiplication module is proposed and an optimized DNN model is developed for real-time RSI segmentation, which shows preferable accuracy compared to other methods.

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs.

This article introduces a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs and demonstrates a high performance, which outperforms the state-of-the-art FPGA accelerators.

A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration

A framework containing model compression and hardware acceleration is proposed to solve the performance bottlenecks of CNN implementation and an accelerator for mapping CNN on field programmable gate array (FPGA) makes it flexible, configurable and efficient for CNN implementation.

Efficient Deconvolution Architecture for Heterogeneous Systems-on-Chip

This paper presents a novel purpose-designed hardware accelerator to perform 2D deconvolutions that reaches a computational capability up to 20% higher, and saves more than 60% and 80% of power consumption and logic resources requirement, respectively, using 5.7× fewer on-chip memory resources.

Memory-Efficient Architecture for Accelerating Generative Networks on FPGA

A novel parametrized deconvolutional architecture based on an FPGA-friendly method is proposed to accelerate the generator of GANs, by storing all intermediate data in on-chip memories and significantly reducing off-chip data transfers.

A Real-Time Object Detection Accelerator with Compressed SSDLite on FPGA

This paper proposes a novel FPGA-based architecture for SSDLiteM2 in combination with hardware optimizations including fused BRB, processing element (PE) sharing and load-balanced channel pruning, and a novel quantization scheme called partial quantization has been developed.

Optimizing CNN-Based Hyperspectral Image Classification on FPGAs

A novel CNN-based algorithm for HSI classification which takes into account hardware efficiency is proposed and achieves comparable processing speed but provides a much higher classification accuracy.

An Intermediate-Centric Dataflow for Transposed Convolution Acceleration on FPGA

An intermediate-centric dataflow scheme is proposed, in which the generation of the intermediate patch is decouple from its further process, aiming to efficiently perform the backward-stencil computation, which constrains the parallel computing of transposed convolution.

References

SHOWING 1-10 OF 35 REFERENCES

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

Optimizing CNN-Based Object Detection Algorithms on Embedded FPGA Platforms

A novel method to optimise CNN-based object detection algorithms targeting embedded FPGA platforms by taking network architectures and resource constraints as input, and tunes hardware parameters with algorithm-specific information to explore the design space and achieve high performance.

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

This paper proposes a novel architecture for implementing Winograd algorithm on FPGAs and proposes an analytical model to predict the resource usage and reason about the performance, and uses the model to guide a fast design space exploration.

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

This paper presents the first convolutional neural network capable of real-time SR of 1080p videos on a single K2 GPU and introduces an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output.

Convolutional neural networks at constrained time cost

  • Kaiming HeJian Sun
  • Computer Science
    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2015
This paper investigates the accuracy of CNNs under constrained time cost, and presents an architecture that achieves very competitive accuracy in the ImageNet dataset, yet is 20% faster than “AlexNet” [14] (16.0% top-5 error, 10-view test).

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.

SCNN: An accelerator for compressed-sparse convolutional neural networks

  • A. ParasharMinsoo Rhu W. Dally
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
The Sparse CNN (SCNN) accelerator architecture is introduced, which improves performance and energy efficiency by exploiting thezero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator.

U-Net: Convolutional Networks for Biomedical Image Segmentation

It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.