GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks
@article{Nai2017GraphPIMEI, title={GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks}, author={Lifeng Nai and Ramyad Hadidi and Jaewoong Sim and Hyojong Kim and Pranith Kumar and Hyesoon Kim}, journal={2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)}, year={2017}, pages={457-468} }
With the emergence of data science, graph computing has become increasingly important these days. Unfortunately, graph computing typically suffers from poor performance when mapped to modern computing systems because of the overhead of executing atomic operations and inefficient utilization of the memory subsystem. Meanwhile, emerging technologies, such as Hybrid Memory Cube (HMC), enable the processing-in-memory (PIM) functionality with offloading operations at an instruction level…
170 Citations
CoPIM: A Concurrency-aware PIM Workload Offloading Architecture for Graph Applications
- Computer Science2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)
- 2021
CoPIM is presented, a novel PIM workload offloading architecture that can dynamically determine which portion of the graph workload can benefit more from PIM-side computation and reduces the size of offloading instructions and also improves the overall performance with less energy consumption.
GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition
- Computer Science2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2018
It is argued that a PIM-based graph processing system should take data organization as a first-order design consideration and proposed GraphP, a novel HMC-based software/hardware co-designed graphprocessing system that drastically reduces communication and energy consumption compared to TESSERACT.
GraphQ: Scalable PIM-Based Graph Processing
- Computer ScienceMICRO
- 2019
GraphQ, an improved PIM-based graph processing architecture over recent architecture Tesseract, that fundamentally eliminates irregular data movements is proposed and it is shown that increasing memory size in PIM also proportionally increases compute capability.
GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing
- Computer ScienceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
- 2019
GraphH, a PIM architecture for graph processing on the hybrid memory cube array, is proposed to tackle all four problems mentioned above, including random access pattern causing local bandwidth degradation, poor locality leading to unpredictable global data access, heavy conflicts on updating the same vertex, and unbalanced workloads across processing units.
Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
- Computer Science2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2019
An in-depth data-type-aware characterization of graph processing workloads on a simulated multi-core architecture finds that the load-load dependency chains involving different application data types form the primary bottleneck in achieving a high memory-level parallelism.
Heterogeneous Memory Subsystem for Natural Graph Analytics
- Computer Science2018 IEEE International Symposium on Workload Characterization (IISWC)
- 2018
This work targets graphs that follow a power-law distribution, for which there is a unique opportunity to significantly boost the overall performance of the memory subsystem, and proposes a novel memory subsystem architecture that leverages this structural graph locality.
A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing
- Computer Science2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2020
There is no absolute winner between these two representative PIM technologies for graph applications, which often exhibit irregular workloads, and a new heterogeneous PIM hardware, called Hetraph, is introduced to facilitate energy-efficient graph processing.
LCCG: a locality-centric hardware accelerator for high throughput of concurrent graph processing
- Computer ScienceSC
- 2021
This paper proposes LCCG, a Locality-Centric programmable accelerator that augments the many-core processor for achieving higher throughput of Concurrent Graph processing jobs and develops a novel topology-aware execution approach into the accelerator design to regularize the graph traversals for multiple jobs on-the-fly according to the graph topology.
Energy characterization of graph workloads
- Computer ScienceSustain. Comput. Informatics Syst.
- 2021
A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System
- Computer ScienceMCHPC@SC
- 2018
Two high- level compiler optimizations, i.e., loop fusion and edge flipping, and one low-level compiler transformation leveraging hardware support for remote atomic updates to address overheads arising from thread migration, creation, synchronization, and atomic operations are explored.
References
SHOWING 1-10 OF 56 REFERENCES
Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals
- Computer ScienceMEMSYS
- 2015
A preliminary study of instruction offloading on HMC 2.0 using graph traversals as an example shows the feasibility of an instruction-level offloading PIM architecture and demonstrates the programmability and performance benefits.
GraphBIG: understanding graph computing in the context of industrial solutions
- Computer ScienceSC15: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2015
This paper characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations, helping users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.
PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture
- Computer Science2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)
- 2015
A new PIM architecture is proposed that does not change the existing sequential programming models and automatically decides whether to execute PIM operations in memory or processors depending on the locality of data, and combines the best parts of conventional and PlM architectures by adapting to data locality of applications.
NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads
- Computer Science2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
- 2014
A number of key elements necessary in realizing efficient NDC operation are described and evaluated, including low-EPI cores, long daisy chains of memory devices, and the dynamic activation of cores and SerDes links.
A scalable processing-in-memory accelerator for parallel graph processing
- Computer Science2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)
- 2015
This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics
- Computer ScienceGRADES
- 2014
A novel graph processing system called System G Native Store is discussed which allows for efficient graph data organization and processing on modern computing architectures and a runtime designed to exploit multiple levels of parallelism and a generic infrastructure that allows users to express graphs with various in memory and persistent storage properties.
Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture
- Computer ScienceACM/IEEE SC 1999 Conference (SC'99)
- 1999
The potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query are demonstrated.
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
- Computer Science2011 International Conference on Parallel Architectures and Compilation Techniques
- 2011
A new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances and shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases.
BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models
- Computer Science2015 International Conference on Parallel Architecture and Compilation (PACT)
- 2015
Bounded Staled Sync (BSSync) is proposed, a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy, targeting iterative convergent machine learning workloads.
Practical Near-Data Processing for In-Memory Analytics Frameworks
- Computer Science2015 International Conference on Parallel Architecture and Compilation (PACT)
- 2015
This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks, and shows that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP.