• Publications
  • Influence
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
TLDR
This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth.
A 45nm CMOS neuromorphic chip with a scalable architecture for learning in networks of spiking neurons
TLDR
A new architecture is proposed to overcome scalable learning algorithms for networks of spiking neurons in silicon by combining innovations in computation, memory, and communication to leverage robust digital neuron circuits and novel transposable SRAM arrays.
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
TLDR
This work systematically explore the trade-offs of hardware cost by searching the design variable configurations, and proposes a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance.
XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks
TLDR
This work proposes a RRAM synaptic architecture with a bit-cell design of complementary word lines that implements equivalent XNOR and bit-counting operation in a parallel fashion and investigates the impact of sensing offsets on classification accuracy and analyzes various design options with different sub-array sizes and sensing bit-levels.
Mitigating effects of non-ideal synaptic device characteristics for on-chip learning
TLDR
This study shows that the recognition accuracy of MNIST handwriting digits degrades from ~97 % to ~65 %, and proposes the mitigation strategies, which include the smart programming schemes for achieving linear weight update, a dummy column to eliminate the off-state current, and the use of multiple cells for each weight element to alleviate the impact of device variations.
An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks
TLDR
This work presents an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGAs and still keep the benefits of low-level hardware optimization.
XNOR-SRAM: In-Memory Computing SRAM Macro for Binary/Ternary Deep Neural Networks
We present an in-memory computing SRAM macro that computes XNOR-and-accumulate in binary/ternary deep neural networks on the bitline without row-by-row data access. It achieves 33X better energy and
Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA
TLDR
This paper quantitatively analyzing and optimizing the design objectives of the CNN accelerator based on multiple design variables and proposes a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance.
Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
TLDR
This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
Specifications of Nanoscale Devices and Circuits for Neuromorphic Computational Systems
TLDR
It is shown that neuromorphic systems based on new nanoscale devices can potentially improve density and power consumption by at least a factor of 10, as compared with conventional CMOS implementations.
...
...