C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

  title={C-SAW: A Framework for Graph Sampling and Random Walk on GPUs},
  author={Santosh Pandey and Lingda Li and Adolfy Hoisie and Xiaoye S. Li and Hang Liu},
  journal={SC20: International Conference for High Performance Computing, Networking, Storage and Analysis},
  • Santosh PandeyLingda Li Hang Liu
  • Published 18 September 2020
  • Computer Science
  • SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
Many applications require to learn, mine, analyze and visualize large-scale graphs. These graphs are often too large to be addressed efficiently using conventional graph processing technologies. Fortunately, recent research efforts find out graph sampling and random walk, which significantly reduce the size of original graphs, can benefit the tasks of learning, mining, analyzing and visualizing large graphs by capturing the desirable graph properties. This paper introduces C-SAW, the first… 

Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs

PaGraph is proposed, a novel, efficient data loader that supports general and efficient sampling-based GNN training on single-server with multi-GPU and embodies a lightweight yet effective caching policy that takes into account graph structural information and data access patterns of sampling- based GNNTraining simultaneously.

Trust: Triangle Counting Reloaded on GPUs

Trust is the first work that achieves over one trillion Traversed Edges Per Second (TEPS) rate for triangle counting, and advocates that hashing can help the key operations for scalable triangle counting on Graphics Processing Units (GPUs).

gIM: GPU Accelerated RIS-Based Influence Maximization Algorithm

This article presents a novel and efficient parallel implementation of a RIS-based algorithm, namely IMM, on GPU, which can significantly reduce the running time on large-scale graphs with low values of <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-Math><alternatives><mml:mi>.

ThunderRW: An In-Memory Graph Random Walk Engine

Experimental results show that ThunderRW outperforms state-of-the-art approaches by an order of magnitude, and the step interleaving technique significantly reduces the CPU pipeline stall from 73.1% to 15.0%.

DSP: Efficient GNN Training with Multiple GPUs

This work proposes a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training, which adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory.

Mining User-aware Multi-relations for Fake News Detection in Large Scale Online Social Networks

A dual-layer graph is constructed to extract multi-relations of news and users in social networks to derive rich information for detecting fake news and the superiority of Us-DeFake which outperforms all baselines is illustrated.

Scalable Deep Learning-Based Microarchitecture Simulation on GPUs

This work proposes the first graphics processing unit (GPU)-based microarchitecture simulator that fully unleashes the power of GPUs to accelerate state-of-the-art ML-based simulators, and proposes a parallel simulation paradigm that partitions the application trace into sub-traces to simulate them in parallel with rigorous error analysis and effective error correction mechanisms.

T-GCN: A Sampling Based Streaming Graph Neural Network System with Hybrid Architecture

T-GCN is proposed, the first sampling-based streaming GNN system, which targets temporal-aware streaming graphs and takes advantage of a hybrid CPU-GPU co-processing architecture to achieve high throughput and low latency and an NVLink-specific task schedule to fully exploit NVLink's fast speed and improve GPU-GPU communication efficiency.

Distributed Graph Neural Network Training: A Survey

A new taxonomy for the optimization techniques in distributed GNN training that address the above challenges, and classifies existing techniques into four categories that are GNN data partition, GNN batch generation, Gnn execution model, and GNN communication protocol.

Scalable Graph Sampling on GPUs with Compressed Graph

A Chunk-wise Graph Compression format (CGC) is introduced to effectively reduce the graph size and save the graph transfer cost and a scalable GPU-based graph sampling framework GraSS is developed and evaluated to demonstrate the efficiency and scalability of GraSS on both real-world and synthetic graphs.



Traversing large graphs on GPUs with unified memory

A lightweight offline graph reordering algorithm, HALO (Harmonic Locality Ordering), is proposed that can be used as a pre-processing step for static graphs and specifically aims to cover large directed real world graphs in addition to undirected graphs whereas prior methods only account for the latter.

Enterprise: breadth-first graph traversal on GPUs

  • Hang LiuH. H. Huang
  • Computer Science
    SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2015
Enterprise is presented, a new GPU-based BFS system that combines three techniques to remove potential performance bottlenecks and is optimized for both top-down and bottom-up BFS.

Parallel edge-based sampling for static and dynamic graphs

A lightweight concurrent hash table coupled with a space-efficient dynamic graph data structure to overcome the challenges and memory constraints of sampling streaming dynamic graphs.

Gunrock: a high-performance graph processing library on the GPU

"Gunrock," the high-level bulk-synchronous graph-processing system targeting the GPU, takes a new approach to abstracting GPU graph analytics: rather than designing an abstraction around computation, Gunrock implements a novel data-centric abstraction centered on operations on a vertex or edge frontier.

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

Graphie, a system to efficiently traverse large-scale graphs on a single GPU that stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing, and relies on two renaming algorithms for high performance.

SIMD-X: Programming and Processing of Graph Algorithms on GPUs

SIMD-X utilizes just-in-time task management which filters out inactive vertices at runtime and intelligently maps various tasks to different amount of GPU cores in pursuit of workload balancing, and leverages push-pull based kernel fusion that reduces a large number of computation kernels to very few.

Power-efficient and highly scalable parallel graph sampling using FPGAs

This paper designs and implementation of an FPGA based graph sampling method which allows time- and memory-efficient representation of graphs suitable for reconfigurable hardware such as FPGAs and shows that the proposed techniques are 2x faster and 3x more energy efficient as compared to serial CPU version of the algorithm.

GraphMat: High performance graph analytics made productive

GraphMat is a single-node multicore graph framework written in C++ that achieves better multicore scalability than other frameworks and is 1.2X off native, hand-optimized code on a variety of graph algorithms.

Fast Random Walk with Restart and Its Applications

The heart of the approach is to exploit two important properties shared by many real graphs: linear correlations and block- wise, community-like structure and exploit the linearity by using low-rank matrix approximation, and the community structure by graph partitioning, followed by the Sherman- Morrison lemma for matrix inversion.

Accurate, Efficient and Scalable Graph Embedding

This paper proposes novel parallelization techniques for graph sampling-based GCNs that achieve superior scalable performance on very large graphs without compromising accuracy, and demonstrates that the parallel graph embedding outperforms state-of theart methods in scalability, efficiency and accuracy on several large datasets.