Analysis and performance results of computing betweenness centrality on IBM Cyclops64

  title={Analysis and performance results of computing betweenness centrality on IBM Cyclops64},
  author={Guangming Tan and Vugranam C. Sreedhar and Guang Rong Gao},
  journal={The Journal of Supercomputing},
This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application—computing betweenness centrality—on a many-core architecture IBM Cyclops64. The characteristics of unstructured parallelism, dynamically non-contiguous memory access, and low arithmetic intensity in betweenness centrality pose an obstacle to an efficient mapping of parallel algorithms on such many-core architectures. By identifying several key architectural… 

Understanding parallelism in graph traversal on multi-core clusters

A new hybrid MPI/Pthreads breadth-first search (BFS) algorithm featuring with (i) overlapping computation and communication by separating them into multiple threads, (ii) maximizing multi-threading parallelism on multi-cores with massive threads to improve throughputs, and (iii) exploiting pipeline parallelism using lock-free queues for asynchronous communication.

The Combinatorial BLAS: design, implementation, and applications

The parallel Combinatorial BLAS is described, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications, and an extensible library interface and some guiding principles for future development are provided.

Reducing Communication in Parallel Breadth-First Search on Distributed Memory Systems

This work proposes a novel distributed directory to sieve the redundant data in collective communications and uses a bitmap compression algorithm to further reduce the size of messages in communication in distributed BFS.

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems

This paper sufficiently reduces the communication cost in distributed BFS by compressing and sieving the messages, and proposes a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages.

Betweenness centrality: algorithms and implementations

A new asynchronous parallel algorithm for betweenness centrality is derived that works seamlessly for both weighted and unweighted graphs, can be applied to large graphs, and is able to extract large amounts of parallelism.

Dynamic Merging of Frontiers for Accelerating the Evaluation of Betweenness Centrality

This work proposes a new algorithm, called dynamic merging of frontiers, which makes use of information about the shortest paths for each vertex to derive the BC score of degree-2 vertices by re-using the results of the sub-trees of the neighbors.

Hierarchical Scheduling of DAG Structured Computations on Manycore Processors with Dynamic Thread Grouping

This work proposes a hierarchical scheduling method with dynamic thread grouping, which schedules DAG structured computations at three different levels, and exhibits superior performance when compared with other various baseline methods, including typical centralized and distributed schedulers.

Self-Adaptive Evidence Propagation on Manycore Processors

  • Yinglong XiaV. Prasanna
  • Computer Science
    2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
  • 2011
This paper proposes a self-adaptive scheduler that dynamically adjusts the number of threads for scheduling or executing tasks according to the task dependency graph, and implements the method on the Sun UltraSPARC T2 (Niagara~2) platform.

A round-efficient distributed betweenness centrality algorithm

Min-Rounds BC (MRBC), a distributed-memory algorithm in the CONGEST model that computes the betweenness centrality (BC) of every vertex in a directed unweighted n-node graph in O(n) rounds, improves the number of rounds by at least a constant factor over previous results.



Characterizing Betweenness Centrality Algorithm on Multi-core Architectures

  • Dengbiao TuGuangming Tan
  • Computer Science
    2009 IEEE International Symposium on Parallel and Distributed Processing with Applications
  • 2009
It is found that dynamically non-contiguous memory access, unstructured parallelism and low arithmetic intensity in BC program pose an obstacle to an efficient execution on parallel architectures.

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

The proposed percolation model for Just-In-Time Locality moves data proactively close to the computation and organizes the data layout such that locality is exploited effectively.

Efficient emulation of hardware prefetchers via event-driven helper threading

  • I. GanusovMartin Burtscher
  • Computer Science
    2006 International Conference on Parallel Architectures and Compilation Techniques (PACT)
  • 2006
This paper proposes a lightweight architectural framework for efficient event-driven software emulation of complex hardware accelerators and describes how this framework can be applied to implement a variety of prefetching techniques.

Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

This paper presents the experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture and considers the following three areas: a memory aware runtime library that places frequently used data structures in scratchpad memory, a unique spin lock algorithm for shared memory synchronization and a fast barrier that directly uses C64 hardware support for collective synchronization.

A scalable approach to thread-level speculation

This paper proposes and evaluates a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down).

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

The Synchronization State Buffer is proposed, a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads that records and manages the states of frequently synchronized data using modest hardware support.

Scaling performance of interior-point method on large-scale chip multiprocessor system

This paper proposes and evaluates several algorithmic and hardware features to improve IPM parallel performance on large-scale CMPs and demonstrates how exploring multiple levels of parallelism with hardware support for low overhead task queues and parallel reduction enables IPM to achieve up to 48X parallel speedup on a 64-core CMP.

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

This work examines sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs, and presents several optimization strategies especially effective for the multicore environment.

Programming models and system software for future high-end computing systems: work-in-progress

A suitable program execution model, a high-level programming notation which shields the application developer from the complexities of the architecture, and a compiler and runtime system based on the underlying models are developed.

Optimistic parallelism requires abstractions

It is shown that Delaunay mesh generation and agglomerative clustering can be parallelized in a straight-forward way using the Galois approach, and results suggest that Galois is a practical approach to exploiting data parallelism in irregular programs.