Semi-Asymmetric Parallel Graph Algorithms for NVRAMs
@article{Dhulipala2019SemiAsymmetricPG, title={Semi-Asymmetric Parallel Graph Algorithms for NVRAMs}, author={Laxman Dhulipala and Charles McGuffey and Hong Kyu Kang and Yan Gu and Guy E. Blelloch and Phillip B. Gibbons and Julian Shun}, journal={ArXiv}, year={2019}, volume={abs/1910.12310} }
Emerging non-volatile main memory (NVRAM) technologies provide novel features for large-scale graph analytics, combining byte-addressability, low idle power, and improved memory-density. Systems are likely to have an order of magnitude more NVRAM than traditional memory (DRAM), allowing large graph problems to be solved efficiently at a modest cost on a single machine. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be…
11 Citations
NVRAM as an Enabler to New Horizons in Graph Processing
- Computer ScienceSN Computer Science
- 2022
It is found that NVRAM enables the processing of exceptionally large graphs on a single node with good performance, price and power consumption and, for the first time, the ability to process a graph of 750 billion edges whilst staying within the memory of a single nodes is demonstrated.
Optimal Parallel Algorithms in the Binary-Forking Model
- Computer ScienceSPAA
- 2020
This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony, and develops the first algorithms with optimal work and span in the binary-forking model.
Analysis of Work-Stealing and Parallel Cache Complexity
- Computer ScienceAPOCS
- 2022
A simplified, classroom-ready version of analysis for the RWS scheduler, which decouples the span from the analysis of the parallel cache complexity, and shows new parallel cache bounds for a list of classic algorithms.
Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model
- Computer Science
- 2020
All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model, and most of the algorithms are simple.
Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficient
- Computer ScienceSPAA
- 2022
This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and proposes general approaches to do so, and uses two types of general techniques to enable work-efficiency and high parallelism.
Many Sequential Iterative Algorithms Can Be Parallel and Work-efficient
- Computer Science
- 2022
This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and proposes general approaches to do so and uses two types of general techniques to enable work-efficiency and high parallelism.
Parallel Cover Trees and their Applications
- Computer Science
- 2022
Using the authors' parallel cover trees, work-efficient (or near-work-efficient) and highly parallel solutions for a list of problems in computational geometry and machine learning, including Euclidean minimum spanning tree, single-linkage clustering, bichromatic closest pair, density-based clustering and its hierarchical version, and others are shown.
Efficient Stepping Algorithms and Implementations for Parallel Shortest Paths
- Computer ScienceSPAA
- 2021
This work proposes the stepping algorithm framework, a new abstract data type, lazy-batched priority queue (LaB-PQ) that abstracts the semantics of the priority queue needed by the stepping algorithms, and implements three algorithms, including ρ-stepping, which is fast in practice and improved bounds for existing algorithms such as Radius-Stepping.
Parallel Cover Trees and their Applications
- Computer ScienceSPAA
- 2022
This paper shows highly parallel and work-efficient cover tree algorithms that can handle batch insertions (and thus construction) and batch deletions and uses three key ideas to guarantee work-efficiency: the prefix-doubling scheme, a careful design to limit the graph size on which it applies MIS, and a strategy to propagate information among different levels in the cover tree.
A Work-Efficient Parallel Algorithm for Longest Increasing Subsequence
- Computer ScienceArXiv
- 2022
This paper proposes a parallel LIS algorithm that costs 𝑂 ( 𝚂 log 𝓘 ) work, ˜ 𝐂 (𝑘 ) span, and 𝒂 ( I𝑛 ) space, and is much simpler than the previous Parallel LIS algorithms.
References
SHOWING 1-10 OF 88 REFERENCES
Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory
- Computer Science2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
- 2010
This work presents a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory to overcome data latencies and provide significant speedup over alternative approaches.
Efficient Subgraph Matching on Non-volatile Memory
- Computer ScienceWISE
- 2017
This paper investigates efficient algorithms for sub graph matching, a fundamental problem in graph databases, on NVM and proposes a write-limited subgraph matching algorithm based on the analysis, which is extended to answer subgraph Matching on dynamic graphs.
Integer Compression in NVRAM-centric Data Stores: Comparative Experimental Analysis to DRAM
- Computer ScienceDaMoN
- 2019
A detailed evaluation of state-of-the-art lightweight integer compression schemes and database operations on NVRAM and compare it with DRAM is provided and a combined approach where both volatile and non-volatile memories are used in a cooperative fashion is investigated.
GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine
- Computer Science2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)
- 2017
This paper proposes GraphMP, a vertex-centric sliding window computation model to avoid reading and writing vertices on disk, and uses a compressed edge cache mechanism to fully utilize the available memory of a machine to reduce the amount of disk accesses for edges.
Write-Optimized and Consistent RDMA-based NVM Systems
- Computer ScienceArXiv
- 2019
Erda is a zero-copy log-structured memory design for Efficient Remote Data Atomicity, called Erda, which reduces NVM writes approximately by 50%, as well as significantly improves throughput and decreases latency.
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
- Computer ScienceFAST
- 2015
This work demonstrates that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance loss by implementing a graph-processing engine on top of a user-space SSD file system designed for high IOPS and extreme parallelism.
Gemini: A Computation-Centric Distributed Graph Processing System
- Computer ScienceOSDI
- 2016
Gemini is presented, a distributed graph processing system that applies multiple optimizations targeting computation performance to build scalability on top of efficiency and significantly outperforms all well-known existing distributed graphprocessing systems.
Single machine graph analytics on massive datasets using Intel optane DC persistent memory
- Computer ScienceProc. VLDB Endow.
- 2020
This paper evaluates four existing shared-memory graph frameworks and one out-of-core graph framework on large real-world graphs using a machine with 6TB of Optane PMM and shows that frameworks using the runtime and algorithmic principles advocated perform significantly better than the others and are competitive with graph analytics frameworks running on production clusters.
Sorting with Asymmetric Read and Write Costs
- Computer ScienceSPAA
- 2015
This paper considers the PRAM model with asymmetric write cost, and presents write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication, which yield provably good bounds for parallel machines with private caches or with a shared cache.
Parallel Algorithms for Asymmetric Read-Write Costs
- Computer ScienceSPAA
- 2016
A nested-parallel model of computation is presented that combines a small per-task stack-allocated memories with symmetric read-write costs and an unbounded heap- allocated shared memory with asymmetric reading costs, and how the costs in the model map efficiently onto a more concrete machine model under a work-stealing scheduler is shown.