A Customized NoC Architecture to Enable Highly Localized Computing-on-the-Move DNN Dataflow

@article{Zhou2021ACN,
  title={A Customized NoC Architecture to Enable Highly Localized Computing-on-the-Move DNN Dataflow},
  author={Kaining Zhou and Yangshuo He and Rui Xiao and Jiayi Liu and Kejie Huang},
  journal={IEEE Transactions on Circuits and Systems II: Express Briefs},
  year={2021},
  volume={69},
  pages={1692-1696}
}
The ever-increasing computation complexity of fast-growing Deep Neural Networks (DNNs) has requested new computing paradigms to overcome the memory wall in conventional Von Neumann computing architectures. The emerging Computing-In-Memory (CIM) architecture has been a promising candidate to accelerate neural network computing. However, data movement between CIM arrays may still dominate the total power consumption in conventional designs. This brief proposes a flexible CIM processor… 

Figures and Tables from this paper

In-Network Accumulation: Extending the Role of NoC for DNN Acceleration

The In-Network Accumulation (INA) method is proposed to further accelerate a DNN workload execution on a many-core spatial DNN accelerator for the Weight Stationary (WS) dataflow model by expanding the router’s function to support partial sum accumulation.

References

SHOWING 1-10 OF 17 REFERENCES

A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing

This paper presents a scalable neural-network inference accelerator in 16nm, based on an array of programmable cores employing mixed-signal In-Memory Computing, digital Near-Memory computing, and localized buffering/control, to overcome overheads of HW virtualization.

Cycle-Accurate Network on Chip Simulation with Noxim

Noxim is presented, an open, configurable, extendible, cycle-accurate NoC simulator developed in SystemC, which allows to analyze the performance and power figures of both conventional wired NoC and emerging WiNoC architectures.

An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications

  • Y. ChihPo-Hao Lee T. Chang
  • Computer Science
    2021 IEEE International Solid- State Circuits Conference (ISSCC)
  • 2021
CIM research is focused on a more analog approach with high-energy efficiency; however, lack of accuracy, due to a low SNR, is the main disadvantage; therefore, an analog approach may not be suitable for some applications that require high accuracy.

CASCADE: Connecting RRAMs to Extend Analog Dataflow In An End-To-End In-Memory Processing Paradigm

This work demonstrates the CASCADE architecture that connects multiply-accumulate RRAM arrays with buffer R RAM arrays to extend the processing in analog and in memory: dot products are followed by partial-sum buffering and accumulation to implement a complete DNN or RNN layer.

29.1 A 40nm 64Kb 56.67TOPS/W Read-Disturb-Tolerant Compute-in-Memory/Digital RRAM Macro with Active-Feedback-Based Read and In-Situ Write Verification

A 64Kb RRAM macro supporting a programmable number of row-accesses to enable vector-matrix multiplication for a target algorithm-level inference-accuracy and in-situ WR verification to enable a tight resistance distribution is presented.

14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse

For a high compression rate and high efficiency, the granularity of sparsity needs to be explored based on CIM characteristics, and system-level weight mapping to a CIM Macro and data-reuse strategies are not well explored - these directions are important for CIM macro utilization and energy efficiency.

Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices

Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs, is presented, which introduces a highly flexible on-chip network that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources.

AtomLayer: A Universal ReRAM-Based CNN Accelerator with Atomic Layer Computation

AtomLayer is proposed–a universal ReRAM-based accelerator to support both efficient CNN training and inference and can achieve higher power efficiency than ISSAC in inference and PipeLayer in training, meanwhile reducing the footprint by 15 ×.

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

A novel dataflow, called row-stationary (RS), is presented that minimizes data movement energy consumption on a spatial architecture and can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine local storage, direct inter-PE communication and spatial parallelism.

Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm