# Design and Analysis of a Hardware CNN Accelerator

@inproceedings{Kiningham2017DesignAA, title={Design and Analysis of a Hardware CNN Accelerator}, author={Kevin Kiningham}, year={2017} }

In recent years, Convolutional Neural Networks (CNNs) have revolutionized computer vision tasks. However, inference in current CNN designs is extremely computationally intensive. This has lead to an explosion of new accelerator architectures designed to reduce power consumption and latency [20]. In this paper, we design and implement a systolic array based architecture we call ConvAU to efficiently accelerate dense matrix multiplication operations in CNNs. We also train an 8-bit quantized… Expand

#### 6 Citations

Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor Crossbars

- Computer Science, Engineering
- ArXiv
- 2019

This paper develops an adaptive precision method by varying the number of memristors at each crosspoint, and presents a weight mapping algorithm designed for implementation on the authors' crossbar array, described as the radix-X Convolutional Neural Network Crossbar Array. Expand

Flexible Modularized Artificial Neural Network Implementation on FPGA

- Computer Science
- 2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI)
- 2018

This work shows that a well modularized network is easily adoptable for different applications hence helping take advantage of the re-configurability of FPGAs. Expand

Performance Implications of Big Data in Scalable Deep Learning: On the Importance of Bandwidth and Caching

- Computer Science
- 2018 IEEE International Conference on Big Data (Big Data)
- 2018

It is found that storage and networking bandwidths are the main parameters determining Deep Learning training performance, and local data caching is an intriguing option that is overlooked in current state-of-the-art systems. Expand

PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

- Computer Science
- 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2020

A case is made for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput. Expand

DCNN for Tactile Sensory Data Classification based on Transfer Learning

- Computer Science
- 2019 15th Conference on Ph.D Research in Microelectronics and Electronics (PRIME)
- 2019

This framework demonstrates a method to achieve touch modality classification using pre-trained convolutional neural networks (CNNs) to address the challenging task of the recognition of the object that was touched by the E-Skin. Expand

#### References

SHOWING 1-10 OF 28 REFERENCES

YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration

- Computer Science
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
- 2018

This paper presents an accelerator optimized for binary-weight CNNs that significantly outperforms the state-of-the-art in terms of energy and area efficiency and removes the need for expensive multiplications, as well as reducing I/O bandwidth and storage. Expand

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

- Computer Science
- 2016 26th International Conference on Field Programmable Logic and Applications (FPL)
- 2016

This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. Expand

ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars

- Computer Science
- 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
- 2016

This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. Expand

Quantized Convolutional Neural Networks for Mobile Devices

- Computer Science
- 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016

This paper proposes an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models. Expand

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

- Computer Science
- ASPLOS 2014
- 2014

This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint. Expand

EIE: Efficient Inference Engine on Compressed Deep Neural Network

- Computer Science
- 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
- 2016

An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. Expand

Improving the speed of neural networks on CPUs

- Computer Science
- 2011

This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy. Expand

Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

- Computer Science
- ArXiv
- 2017

It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design. Expand

Training deep neural networks with low precision multiplications

- Computer Science
- 2014

It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications. Expand

Fixed Point Quantization of Deep Convolutional Networks

- Computer Science, Mathematics
- ICML
- 2016

This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. Expand