• Corpus ID: 30456854

Design and Analysis of a Hardware CNN Accelerator

  title={Design and Analysis of a Hardware CNN Accelerator},
  author={Kevin Kiningham},
In recent years, Convolutional Neural Networks (CNNs) have revolutionized computer vision tasks. However, inference in current CNN designs is extremely computationally intensive. This has lead to an explosion of new accelerator architectures designed to reduce power consumption and latency [20]. In this paper, we design and implement a systolic array based architecture we call ConvAU to efficiently accelerate dense matrix multiplication operations in CNNs. We also train an 8-bit quantized… 

Figures and Tables from this paper

Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor Crossbars

This paper develops an adaptive precision method by varying the number of memristors at each crosspoint, and presents a weight mapping algorithm designed for implementation on the authors' crossbar array, described as the radix-X Convolutional Neural Network Crossbar Array.

Towards Hardware Trojan Resilient Design of Convolutional Neural Networks

This paper investigates a new Hardware Trojan attack that targets the pooling layer of CNN implementations and shows that the accuracy of CNN is reduced by up to 30%.

Flexible Modularized Artificial Neural Network Implementation on FPGA

  • Kiruki CosmasK. Asami
  • Computer Science
    2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI)
  • 2018
This work shows that a well modularized network is easily adoptable for different applications hence helping take advantage of the re-configurability of FPGAs.

Performance Implications of Big Data in Scalable Deep Learning: On the Importance of Bandwidth and Caching

It is found that storage and networking bandwidths are the main parameters determining Deep Learning training performance, and local data caching is an intriguing option that is overlooked in current state-of-the-art systems.

PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

  • Yujeong ChoiMinsoo Rhu
  • Computer Science
    2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • 2020
A case is made for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput.

DCNN for Tactile Sensory Data Classification based on Transfer Learning

This framework demonstrates a method to achieve touch modality classification using pre-trained convolutional neural networks (CNNs) to address the challenging task of the recognition of the object that was touched by the E-Skin.



YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration

This paper presents an accelerator optimized for binary-weight CNNs that significantly outperforms the state-of-the-art in terms of energy and area efficiency and removes the need for expensive multiplications, as well as reducing I/O bandwidth and storage.

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars

This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.

Quantized Convolutional Neural Networks for Mobile Devices

This paper proposes an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models.

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.

EIE: Efficient Inference Engine on Compressed Deep Neural Network

  • Song HanXingyu Liu W. Dally
  • Computer Science
    2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression.

Improving the speed of neural networks on CPUs

This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.

Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design.

Training deep neural networks with low precision multiplications

It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications.

Fixed Point Quantization of Deep Convolutional Networks

This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model.