GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks

@article{Nai2017GraphPIMEI,
  title={GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks},
  author={Lifeng Nai and Ramyad Hadidi and Jaewoong Sim and Hyojong Kim and Pranith Kumar and Hyesoon Kim},
  journal={2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)},
  year={2017},
  pages={457-468}
}
  • L. Nai, Ramyad Hadidi, Hyesoon Kim
  • Published 1 February 2017
  • Computer Science
  • 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)
With the emergence of data science, graph computing has become increasingly important these days. Unfortunately, graph computing typically suffers from poor performance when mapped to modern computing systems because of the overhead of executing atomic operations and inefficient utilization of the memory subsystem. Meanwhile, emerging technologies, such as Hybrid Memory Cube (HMC), enable the processing-in-memory (PIM) functionality with offloading operations at an instruction level… 
CoPIM: A Concurrency-aware PIM Workload Offloading Architecture for Graph Applications
TLDR
CoPIM is presented, a novel PIM workload offloading architecture that can dynamically determine which portion of the graph workload can benefit more from PIM-side computation and reduces the size of offloading instructions and also improves the overall performance with less energy consumption.
GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition
TLDR
It is argued that a PIM-based graph processing system should take data organization as a first-order design consideration and proposed GraphP, a novel HMC-based software/hardware co-designed graphprocessing system that drastically reduces communication and energy consumption compared to TESSERACT.
GraphQ: Scalable PIM-Based Graph Processing
TLDR
GraphQ, an improved PIM-based graph processing architecture over recent architecture Tesseract, that fundamentally eliminates irregular data movements is proposed and it is shown that increasing memory size in PIM also proportionally increases compute capability.
GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing
TLDR
GraphH, a PIM architecture for graph processing on the hybrid memory cube array, is proposed to tackle all four problems mentioned above, including random access pattern causing local bandwidth degradation, poor locality leading to unpredictable global data access, heavy conflicts on updating the same vertex, and unbalanced workloads across processing units.
Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
TLDR
An in-depth data-type-aware characterization of graph processing workloads on a simulated multi-core architecture finds that the load-load dependency chains involving different application data types form the primary bottleneck in achieving a high memory-level parallelism.
Heterogeneous Memory Subsystem for Natural Graph Analytics
TLDR
This work targets graphs that follow a power-law distribution, for which there is a unique opportunity to significantly boost the overall performance of the memory subsystem, and proposes a novel memory subsystem architecture that leverages this structural graph locality.
A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing
TLDR
There is no absolute winner between these two representative PIM technologies for graph applications, which often exhibit irregular workloads, and a new heterogeneous PIM hardware, called Hetraph, is introduced to facilitate energy-efficient graph processing.
LCCG: a locality-centric hardware accelerator for high throughput of concurrent graph processing
TLDR
This paper proposes LCCG, a Locality-Centric programmable accelerator that augments the many-core processor for achieving higher throughput of Concurrent Graph processing jobs and develops a novel topology-aware execution approach into the accelerator design to regularize the graph traversals for multiple jobs on-the-fly according to the graph topology.
Energy characterization of graph workloads
A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System
TLDR
Two high- level compiler optimizations, i.e., loop fusion and edge flipping, and one low-level compiler transformation leveraging hardware support for remote atomic updates to address overheads arising from thread migration, creation, synchronization, and atomic operations are explored.
...
...

References

SHOWING 1-10 OF 56 REFERENCES
Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals
TLDR
A preliminary study of instruction offloading on HMC 2.0 using graph traversals as an example shows the feasibility of an instruction-level offloading PIM architecture and demonstrates the programmability and performance benefits.
GraphBIG: understanding graph computing in the context of industrial solutions
TLDR
This paper characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations, helping users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.
PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture
TLDR
A new PIM architecture is proposed that does not change the existing sequential programming models and automatically decides whether to execute PIM operations in memory or processors depending on the locality of data, and combines the best parts of conventional and PlM architectures by adapting to data locality of applications.
NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads
TLDR
A number of key elements necessary in realizing efficient NDC operation are described and evaluated, including low-EPI cores, long daisy chains of memory devices, and the dynamic activation of cores and SerDes links.
A scalable processing-in-memory accelerator for parallel graph processing
TLDR
This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics
TLDR
A novel graph processing system called System G Native Store is discussed which allows for efficient graph data organization and processing on modern computing architectures and a runtime designed to exploit multiple levels of parallelism and a generic infrastructure that allows users to express graphs with various in memory and persistent storage properties.
Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture
TLDR
The potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query are demonstrated.
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
TLDR
A new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances and shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases.
BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models
TLDR
Bounded Staled Sync (BSSync) is proposed, a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy, targeting iterative convergent machine learning workloads.
Practical Near-Data Processing for In-Memory Analytics Frameworks
TLDR
This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks, and shows that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP.
...
...