A Distributed Multi-GPU System for Fast Graph Processing

@article{Jia2017ADM,
  title={A Distributed Multi-GPU System for Fast Graph Processing},
  author={Zhihao Jia and Yongkee Kwon and Galen M. Shipman and Patrick S. McCormick and Mattan Erez and Alexander Aiken},
  journal={Proc. VLDB Endow.},
  year={2017},
  volume={11},
  pages={297-310}
}
We present Lux, a distributed multi-GPU system that achieves fast graph processing by exploiting the aggregate memory bandwidth of multiple GPUs and taking advantage of locality in the memory hierarchy of multi-GPU clusters. Lux provides two execution models that optimize algorithmic efficiency and enable important GPU optimizations, respectively. Lux also uses a novel dynamic load balancing strategy that is cheap and achieves good load balance across GPUs. In addition, we present a performance… 
MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures
TLDR
This paper proposes MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine that outperforms the state-of-the-art hash join implementations by up to 2.5x and helps improve the overall performance of TPC-H queries byUp to 4.5X over multi-GPU version of an open-source commercial GPU database Omnisci.
A Study of Graph Analytics for Massive Datasets on Distributed Multi-GPUs
TLDR
This paper presents the first detailed analysis of graph analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real- world datasets.
SIMD-X: Programming and Processing of Graph Algorithms on GPUs
TLDR
SIMD-X utilizes just-in-time task management which filters out inactive vertices at runtime and intelligently maps various tasks to different amount of GPU cores in pursuit of workload balancing, and leverages push-pull based kernel fusion that reduces a large number of computation kernels to very few.
GPU-based Graph Traversal on Compressed Graphs
TLDR
This paper introduces GPU-based graph traversal on compressed graphs, designed towards GPU's SIMT architecture, and proposes two novel parallel scheduling strategies Two-Phase Traversal and Task-Stealing to handle thread divergence and workload imbalance issues when decoding the compressed graph.
Self-adaptive Graph Traversal on GPUs
TLDR
This paper introduces SAGE, a self- Adaptive graph traversal on GPUs, which is free from preprocessing and operates on ubiquitous graph representations directly, and proposes Tiled Partitioning and Resident Tile Stealing to fully exploit the computing power of GPUs in a runtime and self-adaptive manner.
AsynGraph: Maximizing Data Parallelism for Efficient Iterative Graph Processing on GPUs
TLDR
This article develops a novel system, called AsynGraph, to maximize its data parallelism, which enables the state propagations of most vertices to be effectively conducted on the GPUs in a concurrent way to get a higher GPU utilization ratio through efficiently handling the paths between the important vertices.
Subway: minimizing data transfer during out-of-GPU-memory graph processing
TLDR
This work designs a fast subgraph generation algorithm with a simple yet efficient subgraph representation and a GPU-accelerated implementation, and brings asynchrony to the subgraph processing, delaying the synchronization between a subgraph in the GPU memory and the rest of the graph in the CPU memory.
DiGraph: An Efficient Path-based Iterative Directed Graph Processing System on Multiple GPUs
TLDR
A novel and efficient iterative directed graph processing system on a machine with the support of multiple GPUs that takes advantage of the dependencies between vertices in three novel ways to help efficient vertex state propagation along the paths over GPUs for faster convergence speed and higher utilization ratio of the loaded data.
Excavating the Potential of GPU for Accelerating Graph Traversal
TLDR
EtaGraph is a novel GPU graph traversal framework optimized for GPU memory system and execution parallelism that uses a frontier-like kernel execution model, featuring a lightweight graph transformation procedure, named Unified Degree Cut, to process skewed graph efficiently without modification of raw data or introducing extra space overhead.
An Adaptive Load Balancer For Graph Analytical Applications on GPUs
TLDR
This scheme is implemented in the IrGL compiler to allow users to generate efficient load balanced code for a GPU from high-level sequential programs and can achieve an average speed-up of 2.2x on inputs that suffer from severe load imbalance problems when previous state-of-the-art load-balancing schemes are used.
...
...

References

SHOWING 1-10 OF 38 REFERENCES
GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs
TLDR
A fast and scalable graph processing method GTS is proposed that handles even RMAT32 (64 billion edges) very efficiently only by using a single machine and consistently and significantly outperforms the major distributed graph processing methods, GraphX, Giraph, and PowerGraph, and the state-of-the-art GPU-based method TOTEM.
CuSha: vertex-centric graph processing on GPUs
TLDR
CuSha is a CUDA-based graph processing framework that overcomes the above obstacle via use of two novel graph representations: G-Shards and Concatenated Windows.
Scalable GPU graph traversal
TLDR
This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.
MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs
TLDR
MapGraph is presented, a high performance parallel graph programming framework that delivers up to 3 billion Traversed Edges Per Second on a GPU and is comparable to state-of-the-art, manually optimized GPU implementations.
Gunrock: a high-performance graph processing library on the GPU
TLDR
"Gunrock," the high-level bulk-synchronous graph-processing system targeting the GPU, takes a new approach to abstracting GPU graph analytics: rather than designing an abstraction around computation, Gunrock implements a novel data-centric abstraction centered on operations on a vertex or edge frontier.
Medusa: Simplified Graph Processing on GPUs
TLDR
This work proposes a programming framework called Medusa which enables developers to leverage the capabilities of GPUs by writing sequential C/C++ code and develops a series of graph-centric optimizations based on the architecture features of GPUs for efficiency.
PGX.D: a fast distributed graph processing engine
TLDR
This paper presents a fast distributed graph processing system, namely PGX.D, as a low-overhead, bandwidth-efficient communication framework that supports remote data-pulling patterns and recommends the use of balanced beefy clusters where the sustained random DRAM-access bandwidth in aggregate is matched with the bandwidth of the underlying interconnection fabric.
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations
TLDR
It is demonstrated that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on 8-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.
MOCgraph: Scalable Distributed Graph Processing Using Message Online Computing
TLDR
This paper proposes MOCgraph, a scalable distributed graph processing framework to reduce the memory footprint and improve the scalability, based on message online computing, and implements it on top of Apache Giraph, and tests it against several representative graph algorithms.
Graph Analytics Through Fine-Grained Parallelism
TLDR
The topological properties of the underlying graph are explored to design and implement a highly effective concurrency control scheme for efficient synchronous processing in an in-memory graph analytical engine and the results show that the proposed hybrid synchronous scheduler has significantly outperformed other synchronous Scheduler in existing graph analytical engines, as well as BSP and asynchronous schedulers.
...
...