EIE: Efficient Inference Engine on Compressed Deep Neural Network

  title={EIE: Efficient Inference Engine on Compressed Deep Neural Network},
  author={Song Han and Xingyu Liu and Huizi Mao and Jing Pu and Ardavan Pedram and Mark Horowitz and William J. Dally},
  journal={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
  • Song Han, Xingyu Liu, W. Dally
  • Published 4 February 2016
  • Computer Science
  • 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and… 
SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks
This work proposes Sparsity-aware Core Extensions (SparCE) - a set of low-overhead micro-architectural and ISA extensions that dynamically detect whether an operand is zero and subsequently skip aSet of future instructions that use it, and improves the performance of DNNs on general-purpose processor (GPP) cores.
High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression
This work detail an architecture dedicated to inference using ternary weights and activations, which allows to achieve up to 5.2k frames per second per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.
Throughput Optimizations for FPGA-based Deep Neural Network Inference
XOMA: exclusive on-chip memory architecture for energy-efficient deep learning acceleration
This paper proposes an on-chip DNN co-processor architecture where minimizing memory accesses is the primary design objective, and to the maximum possible extent, off-chip memoryAccesses are eliminated, providing lowest-possible energy consumption for inference.
DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture
  • Liu Liu, Zheng Qu, Yuan Xie
  • Computer Science
    2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • 2020
This work proposes dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations to save expensive computations and data accesses of unnecessary sensitive activations in DNNs and proposes an algorithm-architecture co-design to boost DNN execution efficiency.
Data-Driven Neuromorphic DRAM-based CNN and RNN Accelerators
Developments over the last 5 years of convolutional and recurrent deep neural network hardware accelerators that exploit either spatial or temporal sparsity similar to SNNs but achieve SOA throughput, power efficiency and latency even with the use of DRAM for the required storage of the weights and states of large DNNs are reported.
EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM
EDEN is the first general framework that reduces DNN energy consumption and DNN evaluation latency by using approximate DRAM devices, while strictly meeting a user-specified target DNN accuracy, and reliably improves the error resiliency of the DNN by an order of magnitude.
Accelerated Inference Framework of Sparse Neural Network Based on Nested Bitmask Structure
This paper contends that Nested Bitmask Neural Network (NBNN), is an efficient neural network structure with only minor accuracy loss on the SoC system and proposes a novel encoding approach on a sparse neural network after pruning.
Espresso provides special convolutional and dense layers for BCNNs, leveraging bit-packing and bitwise computations for efficient execution, and provides a speed-up of matrix-multiplication routines, and at the same time, reduce memory usage when storing parameters and activations.
Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks
This work proposes an in-SRAM architecture for accelerating Convolutional Neural Network inference by leveraging network redundancy and massive parallelism, and proposes an architecture for network models with a reduced bit width by leveraging bit-serial computation.


Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, O. Temam
  • Computer Science
    2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars
This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Learning both Weights and Connections for Efficient Neural Network
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.
ShiDianNao: Shifting vision processing closer to the sensor
This paper proposes an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator, designed down to the layout at 65 nm, with a modest footprint and consuming only 320 mW, but still about 30x faster than high-end GPUs.
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Convolutional networks for fast, energy-efficient neuromorphic computing
This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.
Deep learning with COTS HPC systems
This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.