DaDianNao: A Machine-Learning Supercomputer

@article{Chen2014DaDianNaoAM,
  title={DaDianNao: A Machine-Learning Supercomputer},
  author={Yunji Chen and Tao Luo and Shaoli Liu and Shijin Zhang and Liqiang He and Jia Wang and Ling Li and Tianshi Chen and Zhiwei Xu and Ninghui Sun and Olivier Temam},
  journal={2014 47th Annual IEEE/ACM International Symposium on Microarchitecture},
  year={2014},
  pages={609-622}
}
  • Yunji Chen, Tao Luo, O. Temam
  • Published 13 December 2014
  • Computer Science
  • 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area… 
DaDianNao: A Neural Network Supercomputer
TLDR
A custom multi-chip machine-learning architecture containing a combination of custom storage and computational units, with electrical and optical inter-chip interconnects separately is introduced, and it is shown that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system.
Memory Requirements for Convolutional Neural Network Hardware Accelerators
TLDR
It is shown that bandwidth and memory requirements for different networks, and occasionally for different layers within a network, can each vary by multiple orders of magnitude, which makes designing fast and efficient hardware for all CNN applications difficult.
SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks
TLDR
SCALEDEEP is a dense, scalable server architecture, whose processing, memory and interconnect subsystems are specialized to leverage the compute and communication characteristics of DNNs, and primarily targets DNN training, as opposed to only inference or evaluation.
Deep Fusion: A Software Scheduling Method for Memory Access Optimization
TLDR
A general software scheduling method to reduce memory access cost of DNN algorithms and achieve a speedup of 1.6x in average on the experiment platform and the best result is in ResNet-50, which is up to 56% and 2.62x.
A Convolutional Neural Networks Accelerator Based on Parallel Memory
TLDR
A new CNN accelerator based on parallel memory technology, which can support multiple parallelisms and a super processing unit with kernel buffer and output buffer is proposed to make computation and data fetching more streamline then ensure the performance of accelerator.
A Small-Footprint Accelerator for Large-Scale Neural Networks
TLDR
It is shown that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint, and it is designed with a special emphasis on the impact of memory on accelerator design, performance, and energy.
High performance accelerators for deep neural networks: A review
TLDR
This work provides the state of the art of all these DNN accelerators which have been developed recently in the form of application specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs).
Cambricon-X: An accelerator for sparse neural networks
TLDR
A novel accelerator is proposed, Cambricon-X, to exploit the sparsity and irregularity of NN models for increased efficiency and experimental results show that this accelerator achieves, on average, 7.23x speedup and 6.43x energy saving against the state-of-the-art NN accelerator.
CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks
TLDR
CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/off-chip data allocation, data flow optimization and data reuse, and is utilized to enable fast and efficient reconfiguration of CNNs.
Learning on Hardware: A Tutorial on Neural Network Accelerators and Co-Processors
TLDR
An overview of existing neural network hardware accelerators and acceleration methods is given and a recommendation of suitable applications is given, which focuses on acceleration of the inference of convolutional neural networks used for image recognition tasks.
...
...

References

SHOWING 1-10 OF 56 REFERENCES
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
TLDR
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Improving the speed of neural networks on CPUs
TLDR
This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.
A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification
TLDR
The MAPLE architecture is described, its design space is explored with a simulator, how to automatically map application kernels to the hardware is illustrated, and its performance improvement and energy benefits over classic server-based implementations are presented.
Deep learning with COTS HPC systems
TLDR
This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.
A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm
TLDR
This work fabricated a key building block of a modular neuromorphic architecture, a neurosynaptic core, with 256 digital integrate-and-fire neurons and a 1024×256 bit SRAM crossbar memory for synapses using IBM's 45nm SOI process, leading to ultra-low active power consumption.
A defect-tolerant accelerator for emerging high-performance applications
  • O. Temam
  • Computer Science
    2012 39th Annual International Symposium on Computer Architecture (ISCA)
  • 2012
TLDR
It is empirically show that the conceptual error tolerance of neural networks does translate into the defect tolerance of hardware neural networks, paving the way for their introduction in heterogeneous multi-cores as intrinsically defect-tolerant and energy-efficient accelerators.
Understanding sources of inefficiency in general-purpose chips
TLDR
The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.
SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor
TLDR
The methods by which neural networks are mapped onto the system, and how features designed into the chip are to be exploited in practice are described to ensure that, when the chip is delivered, it will work as anticipated.
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
...
...