Accelerating bandwidth-bound deep learning inference with main-memory accelerators

@article{Cho2021AcceleratingBD,
  title={Accelerating bandwidth-bound deep learning inference with main-memory accelerators},
  author={Benjamin Y. Cho and Jeageun Jung and Mattan Erez},
  journal={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  year={2021}
}
  • Benjamin Y. Cho, Jeageun Jung, M. Erez
  • Published 30 November 2020
  • Computer Science
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Matrix-matrix multiplication operations (GEMMs) are important in many HPC and machine-learning applications. They are often mapped to discrete accelerators (e.g., GPUs) to improve performance. However, we find that large tall/skinny and fat/short matrices benefit little from discrete acceleration and also do not perform well on a CPU. Such matrices are prevalent in important workloads, such as deep-learning inference within large-scale datacenters. We demonstrate the large potential of… 
EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators
TLDR
EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference, and evaluates the efficiency of the dataflows on CNN training workloads and Generative Adversarial Network (GAN)Training workloads.
SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems
TLDR
This work focuses on the development of near-bank PIM designs that tightly couple a PIM core with each DRAM bank, exploiting bank-level parallelism to expose high on-chip memory bandwidth of standard DRAM to processors.
Demystifying BERT: Implications for Accelerator Design
TLDR
This work carefully profile BERT training and identifies key algorithmic behaviors which are worthy of attention in accelerator design, and identifies holistic solutions to optimize systems for BERT-like models.
Efficient Cache Utilization via Model-aware Data Placement for Recommendation Models
TLDR
It is argued that memory subsystems that are more amenable to residency controls stand a better chance to address the needs of emerging models, and of the two key components of these models, namely, embedding tables and multi-layer perceptron layers, it is shown how to exploit the locality of memory accesses toembedding tables to come up with a more nuanced data placement scheme.
PIM-Enclave: Bringing Confidential Computation Inside Memory
TLDR
A novel design for Processing-In-Memory (PIM) as a dataintensive workload accelerator for confidential computing that can provide a side-channel resistant secure computation offloading and run data-intensive applications with negligible performance overhead compared to baseline PIM model is presented.

References

SHOWING 1-10 OF 59 REFERENCES
Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning
TLDR
This work focuses on digital PIM which provides higher bandwidth than PNM and does not incur the reliability issues of analog PIM, and describes Newton, a major DRAM maker’s upcoming accelerator-in-memory (AiM) product for machine learning, which makes PIM feasible for the first time.
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
  • Eric Qin, A. Samajdar, T. Krishna
  • Computer Science
    2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • 2020
TLDR
SIGMA is proposed, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity, and includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN).
Kelp: QoS for Accelerated Machine Learning Systems
TLDR
Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference, is designed and implemented and evaluated with both production and artificial aggressor workloads, and its effectiveness is evaluated.
iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture
  • P. Gu, Xinfeng Xie, Yuan Xie
  • Computer Science
    2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
  • 2020
TLDR
This work proposes iPIM, the first programmable in-memory image processing accelerator using near-bank architecture, and proposes the SIMB (Single-Instruction-Multiple-Bank) ISA to enable flexible control flow and data access and develops iPIM-aware compiler optimizations to improve performance.
Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems
TLDR
It is demonstrated that data buffers in a load-reduced DIMM (LRDIMM), which was originally developed to support large memory systems for servers, are supreme places to integrate near-DRAM accelerators and proposed Chameleon, an NDA architecture that can be realized without relying on 3D/2.5D-stacking technology.
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
  • Liu Ke, Udit Gupta, Xiaodong Wang
  • Computer Science
    2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
  • 2020
TLDR
RecNMP is proposed which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models, and is specifically tailored to production environments with heavy co-location of operators on a single server.
PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning
TLDR
PipeLayer is presented, a ReRAM-based PIM accelerator for CNNs that support both training and testing and proposes highly parallel design based on the notion of parallelism granularity and weight replication, which enables the highly pipelined execution of bothTraining and testing, without introducing the potential stalls in previous work.
MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks
TLDR
A main-memory architecture called MViD is proposed, which performs MV-mul by placing MAC units inside DRAM banks by using a sparse matrix format and exploiting quantization for higher computational efficiency.
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TLDR
The hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory, are presented and it is shown that despite the use of small SRAM buffers, the presence of3D memory simplifies dataflow scheduling for NN computations.
Get Out of the Valley: Power-Efficient Address Mapping for GPUs
  • Yuxi Liu, Xia Zhao, L. Eeckhout
  • Computer Science
    2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)
  • 2018
TLDR
An entropy analysis approach tailored for the highly concurrent memory request behavior in GPU-compute workloads is provided and the Page Address Entropy (PAE) mapping scheme is proposed which concentrates the entropy of the row, channel and bankbits of the input address into the bank and channel bits of the output address.
...
...