Synchronization Trade-Offs in GPU Implementations of Graph Algorithms

@article{Kaleem2016SynchronizationTI,
  title={Synchronization Trade-Offs in GPU Implementations of Graph Algorithms},
  author={Rashid Kaleem and Anand Venkat and Sreepathi Pai and Mary W. Hall and Keshav Pingali},
  journal={2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  year={2016},
  pages={514-523}
}
  • R. KaleemAnand Venkat K. Pingali
  • Published 23 May 2016
  • Computer Science
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Although there is an extensive literature on GPU implementations of graph algorithms, we do not yet have a clear understanding of how implementation choices impact performance. As a step towards this goal, we studied how the choice of synchronization mechanism affects the end-to-end performance of complex graph algorithms, using stochastic gradient descent (SGD) as an exemplar. We implemented seven synchronization strategies for this application and evaluated them on two GPU platforms, using… 

Figures and Tables from this paper

Accelerating Dynamic Graph Analytics on GPUs

This paper proposes a GPU-based dynamic graph storage scheme to support existing graph algorithms easily and proposes parallel update algorithms to support efficient stream updates so that the maintained graph is immediately available for high-speed analytic processing on GPUs.

Technical Report: Accelerating Dynamic Graph Analytics on GPUs

This paper proposes a GPU-based dynamic graph storage scheme to support existing graph algorithms easily and proposes parallel update algorithms to support efficient stream updates so that the maintained graph is immediately available for high-speed analytic processing on GPUs.

A pattern based algorithmic autotuner for graph processing on GPUs

Gswitch is a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead and provides a simple programming interface that conceals low-level tuning details from the user.

Accelerating dynamic graph analytics on GPUs

This paper proposes a GPU-based dynamic graph storage scheme to support existing graph algorithms easily and proposes parallel update algorithms to support e cient stream updates so that the maintained graph is immediately available for high-speed analytic processing on GPUs.

SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU

The hybrid execution mode is automatically switched among three pairs of parameters, with an objective to achieve the shortest execution time in each iteration, and SEP-Graph, a highly efficient software framework for graph-processing on GPU is presented.

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

Graphie, a system to efficiently traverse large-scale graphs on a single GPU that stores the vertex attribute data in the GPU memory and streams edge data asynchronously to the GPU for processing, and relies on two renaming algorithms for high performance.

Specializing Coherence, Consistency, and Push/Pull for GPU Graph Analytics

This work explores the interaction of three communication-centric design dimensions for graph workloads on emerging integrated CPU-GPU systems: update propagation with and without fine-grained synchronization, emerging coherence protocols, and software-centric consistency models.

Energy-Efficient GPU Graph Processing with On-Demand Page Migration

This paper presents a new approach to extracting improved performance-per-watt on large-scale hybrid graph applications with sparse data access patterns, and introduces, two new code transformations, kernel blocking and compute colocation, to exploit page-level locality in host-resident data.

Cooperative kernels: GPU multitasking for blocking algorithms

This work describes a prototype implementation of a cooperative kernel framework implemented in OpenCL 2.0 and evaluates the approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking.

Cooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version)

This work describes a prototype implementation of a cooperative kernel framework implemented in OpenCL 2.0 and evaluates the approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking.

References

SHOWING 1-10 OF 28 REFERENCES

Stochastic gradient descent on GPUs

This work examines several synchronization strategies for SGD, ranging from simple locking to conflict-free scheduling, and finds that the best schedule for some problems can be up to two orders of magnitude faster than the worst one.

Scalable GPU graph traversal

This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.

A GPU implementation of inclusion-based points-to analysis

This paper describes a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis, which achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.

Accelerating Large Graph Algorithms on the GPU Using CUDA

This work presents a few fundamental algorithms - including breadth first search, single source shortest path, and all-pairs shortest path - using CUDA on large graphs using the G80 line of Nvidia GPUs.

Fast minimum spanning tree for large graphs on the GPU

This paper presents a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Borůvka's approach for undirected graphs, implemented using scalable primitives such as scan, segmented scan and split.

Exploiting Parallelism in Iterative Irregular Maxflow Computations on GPU Accelerators

This paper considers a graph-based maximum maximum algorithm that has applications in network optimization problems and shows that the performance of the GPU algorithm far exceeds that of a sequential CPU algorithm.

Scalable parallel minimum spanning forest computation

This paper proposes a novel, scalable, parallel MSF algorithm for undirected weighted graphs that leverages Prim's algorithm in a parallel fashion, concurrently expanding several subsets of the computed MSF.

Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths

It is shown that in general the Near-Far method has the highest performance on modern GPUs, outperforming other parallel methods, and also explores a variety of parallel load-balanced graph traversal strategies and apply them towards the SSSP solver.

Performance Characterization and Optimization of Atomic Operations on AMD GPUs

Using a novel software-based implementation of atomic operations on an AMD GPU can speedup an application kernel by 67-fold over the same application kernel but with the (default) system-provided atomic operations.

Stochastic Gradient Descent with GPGPU

We show how to optimize a Support Vector Machine and a predictor for Collaborative Filtering with Stochastic Gradient Descent on the GPU, achieving 1.66 to 6-times accelerations compared to a