Corpus ID: 30456854

Design and Analysis of a Hardware CNN Accelerator

@inproceedings{Kiningham2017DesignAA,
  title={Design and Analysis of a Hardware CNN Accelerator},
  author={Kevin Kiningham},
  year={2017}
}
In recent years, Convolutional Neural Networks (CNNs) have revolutionized computer vision tasks. However, inference in current CNN designs is extremely computationally intensive. This has lead to an explosion of new accelerator architectures designed to reduce power consumption and latency [20]. In this paper, we design and implement a systolic array based architecture we call ConvAU to efficiently accelerate dense matrix multiplication operations in CNNs. We also train an 8-bit quantized… Expand
Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor Crossbars
TLDR
This paper develops an adaptive precision method by varying the number of memristors at each crosspoint, and presents a weight mapping algorithm designed for implementation on the authors' crossbar array, described as the radix-X Convolutional Neural Network Crossbar Array. Expand
Flexible Modularized Artificial Neural Network Implementation on FPGA
  • Kiruki Cosmas, K. Asami
  • Computer Science
  • 2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI)
  • 2018
TLDR
This work shows that a well modularized network is easily adoptable for different applications hence helping take advantage of the re-configurability of FPGAs. Expand
Performance Implications of Big Data in Scalable Deep Learning: On the Importance of Bandwidth and Caching
TLDR
It is found that storage and networking bandwidths are the main parameters determining Deep Learning training performance, and local data caching is an intriguing option that is overlooked in current state-of-the-art systems. Expand
PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units
  • Yujeong Choi, Minsoo Rhu
  • Computer Science
  • 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • 2020
TLDR
A case is made for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput. Expand
DCNN for Tactile Sensory Data Classification based on Transfer Learning
TLDR
This framework demonstrates a method to achieve touch modality classification using pre-trained convolutional neural networks (CNNs) to address the challenging task of the recognition of the object that was touched by the E-Skin. Expand

References

SHOWING 1-10 OF 28 REFERENCES
YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration
TLDR
This paper presents an accelerator optimized for binary-weight CNNs that significantly outperforms the state-of-the-art in terms of energy and area efficiency and removes the need for expensive multiplications, as well as reducing I/O bandwidth and storage. Expand
Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
TLDR
This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. Expand
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars
TLDR
This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. Expand
Quantized Convolutional Neural Networks for Mobile Devices
TLDR
This paper proposes an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models. Expand
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
TLDR
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint. Expand
EIE: Efficient Inference Engine on Compressed Deep Neural Network
  • Song Han, Xingyu Liu, +4 authors W. Dally
  • Computer Science
  • 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
TLDR
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. Expand
Improving the speed of neural networks on CPUs
TLDR
This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy. Expand
Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations
TLDR
It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design. Expand
Training deep neural networks with low precision multiplications
TLDR
It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications. Expand
Fixed Point Quantization of Deep Convolutional Networks
TLDR
This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. Expand
...
1
2
3
...