Combinatorial BLAS 2.0: Scaling Combinatorial Algorithms on Distributed-Memory Systems

@article{Azad2022CombinatorialB2,
  title={Combinatorial BLAS 2.0: Scaling Combinatorial Algorithms on Distributed-Memory Systems},
  author={Ariful Azad and Oguz Selvitopi and Md Taufique Hussain and John R. Gilbert and Aydın Buluç},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  year={2022},
  volume={33},
  pages={989-1001}
}
Combinatorial algorithms such as those that arise in graph analysis, modeling of discrete systems, bioinformatics, and chemistry, are often hard to parallelize. The Combinatorial BLAS library implements key computational primitives for rapid development of combinatorial algorithms in distributed-memory systems. During the decade since its first introduction, the Combinatorial BLAS library has evolved and expanded significantly. This article details many of the key technical features of… 

Parallel Algorithms for Adding a Collection of Sparse Matrices

TLDR
A series of algorithms using tree merging, heap, sparse accumulator, hash table, and sliding hash table data structures that attain the theoretical lower bounds both on the computational and I/O complexities and perform the best in practice for SpKAdd.

Fast Dynamic Updates and Dynamic SpGEMM on MPI-Distributed Graphs

TLDR
This paper proposes a batch-dynamic algorithm for MPI-based parallel computing that reduces the communication volume of SpGEMM by exploiting that updates change far fewer matrix entries than there are non-zeros in the input operands.

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

TLDR
This paper proposes a tiled parallel SpGEMM algorithm that sparsifies the tiled method in dense general matrix-matrix multiplication, and saves each non-empty tile in a sparse form, and outperforms four state-of-the-art SpGemM methods.

Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

TLDR
This work presents a novel distributed memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome.

Distributed-Memory Sparse Kernels for Machine Learning

Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph

GraphBLAS on the Edge: High Performance Streaming of Network Traffic

TLDR
The performance of GraphBLAS on an Accolade Technologies edge network device is demonstrated on a near worse case traffic scenario using a continuous stream of CAIDA Telescope darknet packets, demonstrating that anonymized hypersparse traf flc matrices are readily computable on edge network devices with minimal compute resources and can be a viable data product for such devices.

GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic

TLDR
The performance of GraphBLAS on an Accolade Technologies edge network device is demonstrated on a near worse case traffic scenario using a continuous stream of CAIDA Telescope darknet packets, demonstrating that anonymized hypersparse traf flc matrices are readily computable on edge network devices with minimal compute resources and can be a viable data product for such devices.

References

SHOWING 1-10 OF 47 REFERENCES

The Combinatorial BLAS: design, implementation, and applications

TLDR
The parallel Combinatorial BLAS is described, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications, and an extensible library interface and some guiding principles for future development are provided.

Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication

TLDR
The evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.

The Reverse Cuthill-McKee Algorithm in Distributed-Memory

TLDR
This paper presents the first-ever distributed-memory implementation of the reverse Cuthill-McKee (RCM) algorithm for reducing the profile of a sparse matrix and achieves high performance by decomposing the problem into a small number of primitives and utilizing optimized implementations of these primitives.

Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

TLDR
This work systematically removes scalability and performance bottlenecks of HipMCL, and enables GPUs by performing the expensive expansion phase of the MCL algorithm on GPU, and proposes a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrates a probabilistic memory requirement estimator that is fast and accurate.

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

TLDR
This work developed a distributed symbolic step to understand the memory requirement and determine the number of batches beforehand, and integrated the multiplication in each batch with an existing communication avoiding techniques to reduce the communication overhead while multiplying matrices in a 3-D process grid.

LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory

  • A. AzadA. Buluç
  • Computer Science
    2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • 2019
TLDR
This paper presents a parallel connected-components algorithm that can run on distributed-memory computers and uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach, which outperforms previous algorithms by a significant margin.

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data

  • Weifeng LiuB. Vinter
  • Computer Science
    2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • 2014
TLDR
This work presents a GPU SpGEMM algorithm that particularly focuses on load balancing, memory pre-allocation for the result matrix, and parallel insert operations of the nonzero entries that is experimentally found to be the fastest GPU merge approach.

Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs

  • A. AzadA. Buluç
  • Computer Science
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • 2016
TLDR
This work designs and implements scalable distributed-memory algorithms for maximum cardinality matching in bipartite graphs and employs bulk-synchronous matrix algebraic modules to implement graph searches, and Remote Memory Access (RMA) operations to map asynchronous light-weight graph accesses.

On the representation and multiplication of hypersparse matrices

  • A. BuluçJ. Gilbert
  • Computer Science
    2008 IEEE International Symposium on Parallel and Distributed Processing
  • 2008
TLDR
This paper develops and analyzes two new algorithms that scale significantly better than existing kernels on the multiplication of sparse matrices (SpGEMM) and considers their algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm that would execute different kernels depending on the sparsity of the input matrices.