• Publications
  • Influence
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Expand
  • 349
  • 29
A 45nm CMOS neuromorphic chip with a scalable architecture for learning in networks of spiking neurons
In this paper, we demonstrated a highly configurable neuromorphic chip with integrated learning for use in pattern classification, recognition, and associative memory tasks. Expand
  • 265
  • 27
  • PDF
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
We quantitatively analyze and optimize the design objectives of a hardware CNN accelerator based on multiple design variables to minimize the memory access and data movement while maximizing the resource utilization. Expand
  • 195
  • 26
XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks
We propose a RRAM synaptic architecture (XNOR-RRAM) with a bit-cell design of complementary word lines that implements equivalent XNOR and bit-counting operation in a parallel fashion. Expand
  • 85
  • 16
Mitigating effects of non-ideal synaptic device characteristics for on-chip learning
The cross-point array architecture with resistive synaptic devices has been proposed for on-chip implementation of weighted sum and weight update in the training process of learning algorithms. Expand
  • 108
  • 13
  • PDF
An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks
We present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FFPA and still keep the benefits of low-level hardware optimization. Expand
  • 70
  • 13
XNOR-SRAM: In-Memory Computing SRAM Macro for Binary/Ternary Deep Neural Networks
We present an in-memory computing SRAM macro that computes XNOR-and-accumulate in binary/ternary deep neural networks on the bitline without row-by-row data access. Expand
  • 81
  • 12
Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA
We quantitatively analyze and optimize the design objectives (e.g., loop optimization) and dataflow of the CNN accelerator based on multiple design variables. Expand
  • 83
  • 10
Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation for end-to-end CNN implementations. Expand
  • 102
  • 7
Fully parallel write/read in resistive synaptic array for accelerating on-chip learning.
A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence andExpand
  • 77
  • 5
  • PDF