DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

@article{Chen2014DianNaoAS,
  title={DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning},
  author={Tianshi Chen and Zidong Du and Ninghui Sun and Jia Wang and Chengyong Wu and Yunji Chen and Olivier Temam},
  journal={Proceedings of the 19th international conference on Architectural support for programming languages and operating systems},
  year={2014}
}
  • Tianshi Chen, Zidong Du, O. Temam
  • Published 24 February 2014
  • Computer Science
  • Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers. [] Key Result Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.
DianNao family
TLDR
A series of hardware accelerators designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy are introduced.
Fast and Efficient Convolutional Accelerator for Edge Computing
TLDR
ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime performance and the peak performance.
SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks
TLDR
SCALEDEEP is a dense, scalable server architecture, whose processing, memory and interconnect subsystems are specialized to leverage the compute and communication characteristics of DNNs, and primarily targets DNN training, as opposed to only inference or evaluation.
SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks
TLDR
This work proposes Sparsity-aware Core Extensions (SparCE) - a set of low-overhead micro-architectural and ISA extensions that dynamically detect whether an operand is zero and subsequently skip aSet of future instructions that use it, and improves the performance of DNNs on general-purpose processor (GPP) cores.
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
TLDR
Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs, is presented, which introduces a highly flexible on-chip network that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources.
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, O. Temam
  • Computer Science
    2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
TLDR
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
DaDianNao: A Neural Network Supercomputer
TLDR
A custom multi-chip machine-learning architecture containing a combination of custom storage and computational units, with electrical and optical inter-chip interconnects separately is introduced, and it is shown that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system.
SOLAR: Services-Oriented Deep Learning Architectures-Deep Learning as a Service
TLDR
Experimental results demonstrate that the SOLAR is able to provide a ubiquitous framework for diverse applications without increasing the burden of the programmers, and the speedup of the GPU and FPGA hardware accelerator in SOLAR can achieve significant speedup comparing to the conventional Intel i5 processors with great scalability.
Understanding the Impact of On-chip Communication on DNN Accelerator Performance
TLDR
The communication flows within CNN inference accelerators of edge devices are studied to justify current and future decisions in the design of the on-chip networks that interconnect their processing elements and the potential impact of introducing the novel paradigm of wireless on- chip network is discussed.
...
...

References

SHOWING 1-10 OF 48 REFERENCES
Deep learning with COTS HPC systems
TLDR
This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.
Improving the speed of neural networks on CPUs
TLDR
This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.
BenchNN: On the broad potential application scope of hardware neural network accelerators
TLDR
Software neural network implementations of 5 RMS applications from the PARSEC Benchmark Suite are developed and evaluated and it is highlighted that a hardware neural network accelerator is indeed compatible with many of the emerging high- performance workloads, currently accepted as benchmarks for high-performance micro-architectures.
Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators
TLDR
This paper proposes to expand the application scope, error tolerance as well as the energy savings of inexact computing systems through neural network architectures, and demonstrates that the proposed inexact neural network accelerator could achieve 43.91%-62.49% savings in energy consumption.
Bridging the computation gap between programmable processors and hardwired accelerators
TLDR
This paper proposes a customized semi-programmable loop accelerator architecture that exploits the efficiency gains available through high levels of customization, while maintaining sufficient flexibility to execute multiple similar loops.
A defect-tolerant accelerator for emerging high-performance applications
  • O. Temam
  • Computer Science
    2012 39th Annual International Symposium on Computer Architecture (ISCA)
  • 2012
TLDR
It is empirically show that the conceptual error tolerance of neural networks does translate into the defect tolerance of hardware neural networks, paving the way for their introduction in heterogeneous multi-cores as intrinsically defect-tolerant and energy-efficient accelerators.
A dynamically configurable coprocessor for convolutional neural networks
TLDR
This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.
A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm
TLDR
This work fabricated a key building block of a modular neuromorphic architecture, a neurosynaptic core, with 256 digital integrate-and-fire neurons and a 1024×256 bit SRAM crossbar memory for synapses using IBM's 45nm SOI process, leading to ultra-low active power consumption.
Understanding sources of inefficiency in general-purpose chips
TLDR
The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.
Towards Hardware Acceleration of Neuroevolution for Multimedia Processing Applications on Mobile Devices
This paper addresses the problem of accelerating large artificial neural networks (ANN), whose topology and weights can evolve via the use of a genetic algorithm. The proposed digital hardware
...
...