EIE: Efficient Inference Engine on Compressed Deep Neural Network
@article{Han2016EIEEI, title={EIE: Efficient Inference Engine on Compressed Deep Neural Network}, author={Song Han and Xingyu Liu and Huizi Mao and Jing Pu and Ardavan Pedram and Mark Horowitz and William J. Dally}, journal={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)}, year={2016}, pages={243-254} }
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and…
Figures and Tables from this paper
1,823 Citations
SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks
- Computer ScienceIEEE Transactions on Computers
- 2019
This work proposes Sparsity-aware Core Extensions (SparCE) - a set of low-overhead micro-architectural and ISA extensions that dynamically detect whether an operand is zero and subsequently skip aSet of future instructions that use it, and improves the performance of DNNs on general-purpose processor (GPP) cores.
High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression
- Computer ScienceACM Trans. Reconfigurable Technol. Syst.
- 2018
This work detail an architecture dedicated to inference using ternary weights and activations, which allows to achieve up to 5.2k frames per second per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.
Throughput Optimizations for FPGA-based Deep Neural Network Inference
- Computer ScienceMicroprocess. Microsystems
- 2018
XOMA: exclusive on-chip memory architecture for energy-efficient deep learning acceleration
- Computer ScienceASP-DAC
- 2019
This paper proposes an on-chip DNN co-processor architecture where minimizing memory accesses is the primary design objective, and to the maximum possible extent, off-chip memoryAccesses are eliminated, providing lowest-possible energy consumption for inference.
DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture
- Computer Science2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
- 2020
This work proposes dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations to save expensive computations and data accesses of unnecessary sensitive activations in DNNs and proposes an algorithm-architecture co-design to boost DNN execution efficiency.
Data-Driven Neuromorphic DRAM-based CNN and RNN Accelerators
- Computer Science2019 53rd Asilomar Conference on Signals, Systems, and Computers
- 2019
Developments over the last 5 years of convolutional and recurrent deep neural network hardware accelerators that exploit either spatial or temporal sparsity similar to SNNs but achieve SOA throughput, power efficiency and latency even with the use of DRAM for the required storage of the weights and states of large DNNs are reported.
EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM
- Computer ScienceMICRO
- 2019
EDEN is the first general framework that reduces DNN energy consumption and DNN evaluation latency by using approximate DRAM devices, while strictly meeting a user-specified target DNN accuracy, and reliably improves the error resiliency of the DNN by an order of magnitude.
Accelerated Inference Framework of Sparse Neural Network Based on Nested Bitmask Structure
- Computer ScienceIJCAI
- 2019
This paper contends that Nested Bitmask Neural Network (NBNN), is an efficient neural network structure with only minor accuracy loss on the SoC system and proposes a novel encoding approach on a sparse neural network after pruning.
BINARY DEEP NEURAL NETWORKS
- Computer Science
- 2018
Espresso provides special convolutional and dense layers for BCNNs, leveraging bit-packing and bitwise computations for efficient execution, and provides a speed-up of matrix-multiplication routines, and at the same time, reduce memory usage when storing parameters and activations.
Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks
- Computer Science2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2019
This work proposes an in-SRAM architecture for accelerating Convolutional Neural Network inference by leveraging network redundancy and massive parallelism, and proposes an architecture for network models with a reduced bit width by leveraging bit-serial computation.
References
SHOWING 1-10 OF 52 REFERENCES
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
- Computer ScienceISCA
- 2016
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an…
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
- Computer ScienceICLR
- 2016
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
DaDianNao: A Machine-Learning Supercomputer
- Computer Science2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
- 2014
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars
- Computer Science2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
- 2016
This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Learning both Weights and Connections for Efficient Neural Network
- Computer ScienceNIPS
- 2015
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.
ShiDianNao: Shifting vision processing closer to the sensor
- Computer Science2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)
- 2015
This paper proposes an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator, designed down to the layout at 65 nm, with a modest footprint and consuming only 320 mW, but still about 30x faster than high-end GPUs.
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
- Computer ScienceASPLOS 2014
- 2014
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Convolutional networks for fast, energy-efficient neuromorphic computing
- Computer ScienceProceedings of the National Academy of Sciences
- 2016
This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
- Computer ScienceFPGA
- 2016
This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.
Deep learning with COTS HPC systems
- Computer ScienceICML
- 2013
This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.