ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars

  title={ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars},
  author={Ali Shafiee and Anirban Nag and Naveen Muralimanohar and Rajeev Balasubramonian and John Paul Strachan and Miao Hu and R. Stanley Williams and Vivek Srikumar},
  journal={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks… 

Figures from this paper

A Versatile ReRAM-based Accelerator for Convolutional Neural Networks

This work proposes a multi-tile ReRAM accelerator for supporting multiple CNN topologies, where each tile processes one or more layers in a pipelined fashion, and designs every tile with 9 processing elements that operate in a systolic fashion.

Trained Biased Number Representation for ReRAM-Based Neural Network Accelerators

A new CNN training and implementation approach that implements weights using a trained biased number representation, which can achieve near full-precision model accuracy with as little as 2-bit weights and 2- bit activations on the CIFAR datasets.

ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing

ATRIA significantly improves the latency, throughput, and efficiency of processing CNN inferences by performing 16 MAC operations in only five consecutive memory operation cycles, compared to the best-performing in-DRAM accelerator from prior work.

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

PipeLayer is presented, a ReRAM-based PIM accelerator for CNNs that support both training and testing and proposes highly parallel design based on the notion of parallelism granularity and weight replication, which enables the highly pipelined execution of bothTraining and testing, without introducing the potential stalls in previous work.

Input-Splitting of Large Neural Networks for Power-Efficient Accelerator with Resistive Crossbar Memory Array

It is demonstrated that any CNN model can be represented with multiple arrays without using intermediate partial sums, and the ADC power of the proposed design is 32x smaller and the total chip power is 3x smaller than those of the baseline design.

Analog Weights in ReRAM DNN Accelerators

This paper presents a novel scheme in alleviating the single-bit-per-device restriction by exploiting frequency dependence of v-i plane hysteresis, and assigning kernel information not only to the device conductance but also partially distributing it to the frequency of a time-varying input.

PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-Efficient ReRAM

PANTHER, an ISA-programmable training accelerator with compiler support, is developed and can be integrated into other accelerators in the literature to enhance their efficiency.

Design and Optimization of Hardware Accelerators for Deep Learning

This dissertation proposes two hardware units, ISAAC and Newton, and shows that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars.

ODIN: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM

A novel processing-in-memory (PIM) engine called ODIN is presented that employs hybrid binary-stochastic bit-parallel arithmetic inside phase change RAM (PCRAM) to enable a low-overhead in-situ acceleration of all essential ANN functions such as multiply-accumulate (MAC), nonlinear activation, and pooling.

MAX2: An ReRAM-Based Neural Network Accelerator That Maximizes Data Reuse and Area Utilization

A multi-tile ReRAM accelerator framework for supporting multiple CNN topologies that maximizes on-chip data reuse and reduces on- chip bandwidth to minimize energy consumption due to data movement and a detailed energy and area breakdown of each component at the PE level, tile level, and system level.



Accelerating Deep Convolutional Neural Networks Using Specialized Hardware

Hardware specialization in the form of GPGPUs, FPGAs, and ASICs offers a promising path towards major leaps in processing capability while achieving high energy efficiency, and combining multiple FPGA over a low-latency communication fabric offers further opportunity to train and evaluate models of unprecedented size and quality.

PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory

  • Ping ChiShuangchen Li Yuan Xie
  • Computer Science
    2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.

DaDianNao: A Machine-Learning Supercomputer

  • Yunji ChenTao Luo O. Temam
  • Computer Science
    2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.

ShiDianNao: Shifting vision processing closer to the sensor

This paper proposes an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator, designed down to the layout at 65 nm, with a modest footprint and consuming only 320 mW, but still about 30x faster than high-end GPUs.

Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication

The Dot-Product Engine (DPE) is developed as a high density, high power efficiency accelerator for approximate matrix-vector multiplication, invented a conversion algorithm to map arbitrary matrix values appropriately to memristor conductances in a realistic crossbar array.

CNP: An FPGA-based processor for Convolutional Networks

The implementation exploits the inherent parallelism of ConvNets and takes full advantage of multiple hardware multiplyaccumulate units on the FPGA and can be used for low-power, lightweight embedded vision systems for micro-UAVs and other small robots.

EIE: Efficient Inference Engine on Compressed Deep Neural Network

  • Song HanXingyu Liu W. Dally
  • Computer Science
    2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression.

Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators

The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an

Origami: A Convolutional Network Accelerator

This paper presents the first convolutional network accelerator which is scalable to network sizes that are currently only handled by workstation GPUs, but remains within the power envelope of embedded systems.

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.