YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration

@article{Andri2018YodaNNAA,
  title={YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration},
  author={Renzo Andri and Lukas Cavigelli and Davide Rossi and Luca Benini},
  journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
  year={2018},
  volume={37},
  pages={48-60}
}
  • Renzo Andri, L. Cavigelli, L. Benini
  • Published 17 June 2016
  • Computer Science
  • IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today’s CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded… 
Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine
TLDR
Hyperdrive is presented: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for an arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level.
XNORBIN: A 95 TOp/s/W hardware accelerator for binary convolutional neural networks
TLDR
XNORBIN is presented, a flexible accelerator for binary CNNs with computation tightly coupled to memory for aggressive data reuse supporting even non-trivial network topologies with large feature map volumes.
Hyperdrive: A Systolically Scalable Binary-Weight CNN Inference Engine for mW IoT End-Nodes
TLDR
Hyperdrive is presented: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, and capable of handling high-resolution images by virtue of its systolic-scalable architecture.
ChewBaccaNN: A Flexible 223 TOPS/W BNN Accelerator
TLDR
ChewBaccaNN is presented, a 0.7 mm2 sized binary convolutional neural network (CNN) accelerator designed in GlobalFoundries 22 nm technology that can perform CIFAR-10 inference at 86.8% accuracy and perform inference on a binarized ResNet-18 trained with 8-bases Group-Net to achieve a 67.5% Top-1 accuracy.
Towards energy-efficient convolutional neural network inference
TLDR
This thesis first evaluates the capabilities of off-the-shelf software-programmable hardware before diving into specialized hardware accelerators and exploring the potential of extremely quantized CNNs, and gives special consideration to external memory bandwidth.
IMC: energy-efficient in-memory convolver for accelerating binarized deep neural network
TLDR
A way towards novel concept of in-memory convolver (IMC) that could implement the dominant convolution computation within main memory based on the authors' proposed Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array architecture to greatly reduce data communication and thus accelerate Binary CNN (BCNN).
An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
TLDR
An efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture with a novel coarse grain task partitioning strategy is proposed, which can nearly double the throughput for various CNN models on average.
COSY: An Energy-Efficient Hardware Architecture for Deep Convolutional Neural Networks Based on Systolic Array
  • Chen Xin, Qiang Chen, Bo Wang
  • Computer Science
    2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)
  • 2017
TLDR
COSY (CNN on Systolic Array), an energy-efficient hardware architecture based on the systolic array for CNNs, which can achieve an over 15% reduction in energy consumption under the same constraints, and it is proved that COSY has the intrinsic ability for zero-skipping.
Deep Neural Network Acceleration in Non-Volatile Memory: A Digital Approach
TLDR
These findings show significant optimization opportunities to replace computationally-intensive convolution operations (based on multiplication) with more efficient and less complex operations such as addition to tackle DNN power and memory wall bottleneck.
EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators
TLDR
This work introduces and evaluates a novel, hardware-friendly, and lossless compression scheme for the feature maps present within convolutional neural networks, and achieves compression factors for gradient map compression during training that are even better than for inference.
...
...

References

SHOWING 1-10 OF 70 REFERENCES
A heterogeneous multi-core system-on-chip for energy efficient brain inspired vision
TLDR
This work proposes a 65nm system-on-chip implementing a hybrid HW/SW CNN accelerator while meeting this energy efficiency target, and proposes a near-threshold parallel processor cluster and hardware accelerator for convolution-accumulation operations, which constitute the basic kernel of CNNs.
ShiDianNao: Shifting vision processing closer to the sensor
TLDR
This paper proposes an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator, designed down to the layout at 65 nm, with a modest footprint and consuming only 320 mW, but still about 30x faster than high-end GPUs.
Origami: A 803-GOp/s/W Convolutional Network Accelerator
  • L. Cavigelli, L. Benini
  • Computer Science
    IEEE Transactions on Circuits and Systems for Video Technology
  • 2017
TLDR
A new architecture, design, and implementation, as well as the first reported silicon measurements of such an accelerator, outperforming previous work in terms of power, area, and I/O efficiency are presented.
Origami: A Convolutional Network Accelerator
TLDR
This paper presents the first convolutional network accelerator which is scalable to network sizes that are currently only handled by workstation GPUs, but remains within the power envelope of embedded systems.
Accelerating Deep Convolutional Neural Networks Using Specialized Hardware
TLDR
Hardware specialization in the form of GPGPUs, FPGAs, and ASICs offers a promising path towards major leaps in processing capability while achieving high energy efficiency, and combining multiple FPGA over a low-latency communication fabric offers further opportunity to train and evaluate models of unprecedented size and quality.
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars
TLDR
This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Energy-efficient ConvNets through approximate computing
TLDR
Methods based on approximate computing to reduce energy consumption in state-of-the-art ConvNet accelerators are proposed and can gain energy in the systems arithmetic: up to 30× without losing classification accuracy and more than 100× at 99% classification accuracy, compared to the commonly used 16-bit fixed point number format.
A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters
TLDR
This work proposes to augment many-core architectures using shared-memory clusters of power-optimized RISC processors with Hardware Convolution Engines (HWCEs): ultra-low energy coprocessors for accelerating convolutions, the main building block of many brain-inspired computer vision algorithms.
14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks
TLDR
To achieve state-of-the-art accuracy, CNNs with not only a larger number of layers, but also millions of filters weights, and varying shapes are needed, which results in substantial data movement, which consumes significant energy.
...
...