# Design and Analysis of a Hardware CNN Accelerator

@inproceedings{Kiningham2017DesignAA, title={Design and Analysis of a Hardware CNN Accelerator}, author={Kevin Kiningham}, year={2017} }

In recent years, Convolutional Neural Networks (CNNs) have revolutionized computer vision tasks. However, inference in current CNN designs is extremely computationally intensive. This has lead to an explosion of new accelerator architectures designed to reduce power consumption and latency [20]. In this paper, we design and implement a systolic array based architecture we call ConvAU to efficiently accelerate dense matrix multiplication operations in CNNs. We also train an 8-bit quantized…

## 7 Citations

### Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor Crossbars

- Computer ScienceArXiv
- 2019

This paper develops an adaptive precision method by varying the number of memristors at each crosspoint, and presents a weight mapping algorithm designed for implementation on the authors' crossbar array, described as the radix-X Convolutional Neural Network Crossbar Array.

### Towards Hardware Trojan Resilient Design of Convolutional Neural Networks

- Computer Science2022 IEEE 35th International System-on-Chip Conference (SOCC)
- 2022

This paper investigates a new Hardware Trojan attack that targets the pooling layer of CNN implementations and shows that the accuracy of CNN is reduced by up to 30%.

### Flexible Modularized Artificial Neural Network Implementation on FPGA

- Computer Science2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI)
- 2018

This work shows that a well modularized network is easily adoptable for different applications hence helping take advantage of the re-configurability of FPGAs.

### Performance Implications of Big Data in Scalable Deep Learning: On the Importance of Bandwidth and Caching

- Computer Science2018 IEEE International Conference on Big Data (Big Data)
- 2018

It is found that storage and networking bandwidths are the main parameters determining Deep Learning training performance, and local data caching is an intriguing option that is overlooked in current state-of-the-art systems.

### PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

- Computer Science2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2020

A case is made for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput.

### DCNN for Tactile Sensory Data Classification based on Transfer Learning

- Computer Science2019 15th Conference on Ph.D Research in Microelectronics and Electronics (PRIME)
- 2019

This framework demonstrates a method to achieve touch modality classification using pre-trained convolutional neural networks (CNNs) to address the challenging task of the recognition of the object that was touched by the E-Skin.

## References

SHOWING 1-10 OF 28 REFERENCES

### YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration

- Computer ScienceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
- 2018

This paper presents an accelerator optimized for binary-weight CNNs that significantly outperforms the state-of-the-art in terms of energy and area efficiency and removes the need for expensive multiplications, as well as reducing I/O bandwidth and storage.

### Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

- Computer Science2016 26th International Conference on Field Programmable Logic and Applications (FPL)
- 2016

This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

### ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars

- Computer Science2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
- 2016

This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.

### Quantized Convolutional Neural Networks for Mobile Devices

- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016

This paper proposes an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models.

### DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

- Computer ScienceASPLOS 2014
- 2014

This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.

### EIE: Efficient Inference Engine on Compressed Deep Neural Network

- Computer Science2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
- 2016

An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression.

### Improving the speed of neural networks on CPUs

- Computer Science
- 2011

This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.

### Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

- Computer ScienceArXiv
- 2017

It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design.

### Training deep neural networks with low precision multiplications

- Computer Science
- 2014

It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications.

### Fixed Point Quantization of Deep Convolutional Networks

- Computer ScienceICML
- 2016

This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model.