Parallel breadth-first search on distributed memory systems

@article{Bulu2011ParallelBS,
  title={Parallel breadth-first search on distributed memory systems},
  author={A. Buluç and Kamesh Madduri},
  journal={2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
  year={2011},
  pages={1-12}
}
  • A. Buluç, Kamesh Madduri
  • Published 2011
  • Computer Science
  • 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and… Expand
Fast and scalable NUMA-based thread parallel breadth-first search
  • Yuichiro Yasui, K. Fujisawa
  • Computer Science
  • 2015 International Conference on High Performance Computing & Simulation (HPCS)
  • 2015
TLDR
This paper investigates the locality of memory accesses in terms of the communication with remote memories in a BFS for a non-uniform memory access (NUMA)-based system, and describes a fast and highly scalable implementation. Expand
Reducing Communication in Parallel Breadth-First Search on Distributed Memory Systems
TLDR
This work proposes a novel distributed directory to sieve the redundant data in collective communications and uses a bitmap compression algorithm to further reduce the size of messages in communication in distributed BFS. Expand
Task-based parallel breadth-first search in heterogeneous environments
TLDR
This study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence, and uses a fine-grained task-based parallelization scheme and the OmpSs programming model to achieve that goal. Expand
Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems
TLDR
It is shown that for small core counts many of these algorithms show rather similar behaviour, but, for large core counts and large graphs, there are considerable differences in performance and scalability influenced by several factors. Expand
Understanding parallelism in graph traversal on multi-core clusters
TLDR
A new hybrid MPI/Pthreads breadth-first search (BFS) algorithm featuring with (i) overlapping computation and communication by separating them into multiple threads, (ii) maximizing multi-threading parallelism on multi-cores with massive threads to improve throughputs, and (iii) exploiting pipeline parallelism using lock-free queues for asynchronous communication. Expand
NUMA-optimized parallel breadth-first search on multicore single-node system
TLDR
This paper describes a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. Expand
Scalable Triangle Counting on Distributed-Memory Systems
TLDR
This work proposes a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation that achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. Expand
Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines
TLDR
A new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the K-Computer. Expand
Graph partitioning for scalable distributed graph computations
TLDR
This work uses breadth-first search as a representative example, and derives upper bounds on the communication costs incurred with a two-dimensional partitioning of the graph, and presents empirical results for communication costs with various graph partitioning strategies. Expand
Optimizing Breadth-First Search at Scale Using Hardware-Accelerated Space Consistency
  • K. Ibrahim
  • Computer Science
  • 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)
  • 2019
TLDR
This work presents both an efficient algorithmic approach to carry out the traversal and a low-overhead runtime that provides efficient primitives to implement the algorithm, and extends the model to leverage hardware accelerated collectives and provide primitives for one-sided broadcast and sparse reduction. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 51 REFERENCES
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
TLDR
This paper presents fast parallel implementations of three fundamental graph theory problems, breadth-first search, st-connectivity and shortest paths for unweighted graphs, on multithreaded architectures such as the Cray MTA-2, and reports impressive results, both for algorithm execution time and parallel performance. Expand
Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory
  • R. Pearce, M. Gokhale, N. Amato
  • Computer Science
  • 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2010
TLDR
This work presents a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory to overcome data latencies and provide significant speedup over alternative approaches. Expand
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
TLDR
This paper presents a distributed breadth- first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges, and develops efficient collective communication functions for the 3D torus architecture of BlueGene/L that take advantage of the structure in the problem. Expand
TOPOLOGICALLY ADAPTIVE PARALLEL BREADTH-FIRST SEARCH ON MULTICORE PROCESSORS
Breadth-first Search (BFS) is a fundamental graph theory algorithm that is extensively used to abstract various challenging computational problems. Due to the fine-grained irregular memory accesses,Expand
Lifting sequential graph algorithms for distributed-memory parallel computation
TLDR
This paper revisits the abstractions comprising the Boost Graph Library in the context of distributed-memory parallelism, lifting away the implicit requirements of sequential execution and a single shared address space and develops general principles and patterns for using (and reusing) generic, object-oriented parallel software libraries. Expand
Scalable Graph Exploration on Multicore Processors
TLDR
This paper designs a breadth-first search algorithm for advanced multi-core processors that are likely to become the building blocks of future exascale systems, and presents an experimental study that uses state-of-the-art Intel Nehalem EP and EX processors and up to 64 threads in a single system. Expand
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)
TLDR
A general method for analyzing nondeterministic programs that use reducers and it is shown that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm attains near-perfect linear speedup if P << (V+E)/Dlg3(V/D). Expand
Design of a Large-Scale Hybrid-Parallel Graph Library
The focus of traditional scientific computing has been in solving large systems of PDEs (and the corresponding linear algebra problems that they induce). Hardware architectures, computer systems, andExpand
Efficient Breadth-First Search on the Cell/BE Processor
TLDR
The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processor-specific optimizations. Expand
Fast PGAS Implementation of Distributed Graph Algorithms
TLDR
This work presents the first fast PGAS implementation of graph algorithms for the connected components and minimum spanning tree problems, and achieves significant speedups over both the best sequential implementation and the best single-node SMP implementation for large, sparse graphs with more than a billion edges. Expand
...
1
2
3
4
5
...