Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

@article{Chen2016EyerissAS,
  title={Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks},
  author={Yu-hsin Chen and Joel S. Emer and Vivienne Sze},
  journal={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
  year={2016},
  pages={367-379}
}
  • Yu-hsin Chen, J. Emer, V. Sze
  • Published 1 June 2016
  • Computer Science
  • 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption… 
A Kernel Unfolding Approach to Trade Data Movement with Computation Power for CNN Acceleration
TLDR
A kernel unfolding technique is proposed to eliminate the duplicated feeding on input feature map, and meanwhile, memory cells in PIM are highly utilized to achieve peak computing throughput and the memory bandwidth could be utilized efficiently and the corresponding execution time could be reduced significantly.
EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators
TLDR
EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference, and evaluates the efficiency of the dataflows on CNN training workloads and Generative Adversarial Network (GAN)Training workloads.
CENNA: Cost-Effective Neural Network Accelerator
TLDR
A cost-effective neural network accelerator, named CENNA, whose hardware cost is reduced by employing a cost-centric matrix multiplication that employs both Strassen’s multiplication and a naive multiplication, which can minimize data movement.
Multi-Mode Inference Engine for Convolutional Neural Networks
TLDR
A dataflow which enables to perform both the fully-connected and convolutional computations for any filter/layer size using the same PEs is proposed and a multi-mode inference engine (MMIE) based on the proposed dataflow is introduced.
Towards energy-efficient convolutional neural network inference
TLDR
This thesis first evaluates the capabilities of off-the-shelf software-programmable hardware before diving into specialized hardware accelerators and exploring the potential of extremely quantized CNNs, and gives special consideration to external memory bandwidth.
Fast and Efficient Convolutional Accelerator for Edge Computing
TLDR
ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime performance and the peak performance.
CNN Acceleration With Hardware-Efficient Dataflow for Super-Resolution
TLDR
This article proposes the hardware-efficient dataflow of CNN-based SR that reduces computation load by increasing data reuse and increases process element (PE) utilization by balancing the computation load among layers for high throughput.
Towards Fast and Energy-Efficient Binarized Neural Network Inference on FPGA
TLDR
Two types of fast and energy-efficient architectures for BNN inference are proposed and analysis and insights are provided to pick the better strategy of these two for different datasets and network models.
An Architecture to Accelerate Convolution in Deep Neural Networks
TLDR
This paper proposes an efficient computational method, which is inspired by a computational core of fully connected neural networks, to process convolutional layers of state-of-the-art deep CNNs within strict latency requirements, and implemented its method customized for VGG and VGG-based networks which have shown state of theart performance on different classification/recognition data sets.
Accelerating CNN Inference on ASICs: A Survey
...
...

References

SHOWING 1-10 OF 46 REFERENCES
14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks
TLDR
To achieve state-of-the-art accuracy, CNNs with not only a larger number of layers, but also millions of filters weights, and varying shapes are needed, which results in substantial data movement, which consumes significant energy.
ShiDianNao: Shifting vision processing closer to the sensor
TLDR
This paper proposes an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator, designed down to the layout at 65 nm, with a modest footprint and consuming only 320 mW, but still about 30x faster than high-end GPUs.
Memory-centric accelerator design for Convolutional Neural Networks
TLDR
It is shown that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload and ensures that on-chip memory size is minimized, which reduces area and energy usage.
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
TLDR
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
A dynamically configurable coprocessor for convolutional neural networks
TLDR
This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, O. Temam
  • Computer Science
    2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
TLDR
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks
TLDR
The nn-X system is presented, a scalable, low-power coprocessor for enabling real-time execution of deep neural networks, able to achieve a peak performance of 227 G-ops/s, which translates to a performance per power improvement of 10 to 100 times that of conventional mobile and desktop processors.
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
TLDR
This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Origami: A Convolutional Network Accelerator
TLDR
This paper presents the first convolutional network accelerator which is scalable to network sizes that are currently only handled by workstation GPUs, but remains within the power envelope of embedded systems.
4.6 A1.93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications
TLDR
A high-performance and energy-efficient DL/DI (deep inference) processor is required to realize user-centric pattern recognition in portable devices.
...
...