• Corpus ID: 204904742

Semi-Asymmetric Parallel Graph Algorithms for NVRAMs

@article{Dhulipala2019SemiAsymmetricPG,
  title={Semi-Asymmetric Parallel Graph Algorithms for NVRAMs},
  author={Laxman Dhulipala and Charles McGuffey and Hong Kyu Kang and Yan Gu and Guy E. Blelloch and Phillip B. Gibbons and Julian Shun},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.12310}
}
Emerging non-volatile main memory (NVRAM) technologies provide novel features for large-scale graph analytics, combining byte-addressability, low idle power, and improved memory-density. Systems are likely to have an order of magnitude more NVRAM than traditional memory (DRAM), allowing large graph problems to be solved efficiently at a modest cost on a single machine. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be… 

Figures and Tables from this paper

NVRAM as an Enabler to New Horizons in Graph Processing

It is found that NVRAM enables the processing of exceptionally large graphs on a single node with good performance, price and power consumption and, for the first time, the ability to process a graph of 750 billion edges whilst staying within the memory of a single nodes is demonstrated.

Optimal Parallel Algorithms in the Binary-Forking Model

This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony, and develops the first algorithms with optimal work and span in the binary-forking model.

Analysis of Work-Stealing and Parallel Cache Complexity

A simplified, classroom-ready version of analysis for the RWS scheduler, which decouples the span from the analysis of the parallel cache complexity, and shows new parallel cache bounds for a list of classic algorithms.

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model, and most of the algorithms are simple.

Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficient

This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and proposes general approaches to do so, and uses two types of general techniques to enable work-efficiency and high parallelism.

Many Sequential Iterative Algorithms Can Be Parallel and Work-efficient

This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and proposes general approaches to do so and uses two types of general techniques to enable work-efficiency and high parallelism.

Parallel Cover Trees and their Applications

Using the authors' parallel cover trees, work-efficient (or near-work-efficient) and highly parallel solutions for a list of problems in computational geometry and machine learning, including Euclidean minimum spanning tree, single-linkage clustering, bichromatic closest pair, density-based clustering and its hierarchical version, and others are shown.

Efficient Stepping Algorithms and Implementations for Parallel Shortest Paths

This work proposes the stepping algorithm framework, a new abstract data type, lazy-batched priority queue (LaB-PQ) that abstracts the semantics of the priority queue needed by the stepping algorithms, and implements three algorithms, including ρ-stepping, which is fast in practice and improved bounds for existing algorithms such as Radius-Stepping.

Parallel Cover Trees and their Applications

This paper shows highly parallel and work-efficient cover tree algorithms that can handle batch insertions (and thus construction) and batch deletions and uses three key ideas to guarantee work-efficiency: the prefix-doubling scheme, a careful design to limit the graph size on which it applies MIS, and a strategy to propagate information among different levels in the cover tree.

A Work-Efficient Parallel Algorithm for Longest Increasing Subsequence

This paper proposes a parallel LIS algorithm that costs 𝑂 ( 𝚂 log 𝓘 ) work, ˜ 𝐂 (𝑘 ) span, and 𝒂 ( I𝑛 ) space, and is much simpler than the previous Parallel LIS algorithms.

References

SHOWING 1-10 OF 88 REFERENCES

Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory

  • R. PearceM. GokhaleN. Amato
  • Computer Science
    2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2010
This work presents a novel asynchronous approach to compute Breadth-First-Search (BFS), Single-Source-Shortest-Paths, and Connected Components for large graphs in shared memory to overcome data latencies and provide significant speedup over alternative approaches.

Efficient Subgraph Matching on Non-volatile Memory

This paper investigates efficient algorithms for sub graph matching, a fundamental problem in graph databases, on NVM and proposes a write-limited subgraph matching algorithm based on the analysis, which is extended to answer subgraph Matching on dynamic graphs.

Integer Compression in NVRAM-centric Data Stores: Comparative Experimental Analysis to DRAM

A detailed evaluation of state-of-the-art lightweight integer compression schemes and database operations on NVRAM and compare it with DRAM is provided and a combined approach where both volatile and non-volatile memories are used in a cooperative fashion is investigated.

GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine

This paper proposes GraphMP, a vertex-centric sliding window computation model to avoid reading and writing vertices on disk, and uses a compressed edge cache mechanism to fully utilize the available memory of a machine to reduce the amount of disk accesses for edges.

Write-Optimized and Consistent RDMA-based NVM Systems

Erda is a zero-copy log-structured memory design for Efficient Remote Data Atomicity, called Erda, which reduces NVM writes approximately by 50%, as well as significantly improves throughput and decreases latency.

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

This work demonstrates that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance loss by implementing a graph-processing engine on top of a user-space SSD file system designed for high IOPS and extreme parallelism.

Gemini: A Computation-Centric Distributed Graph Processing System

Gemini is presented, a distributed graph processing system that applies multiple optimizations targeting computation performance to build scalability on top of efficiency and significantly outperforms all well-known existing distributed graphprocessing systems.

Single machine graph analytics on massive datasets using Intel optane DC persistent memory

This paper evaluates four existing shared-memory graph frameworks and one out-of-core graph framework on large real-world graphs using a machine with 6TB of Optane PMM and shows that frameworks using the runtime and algorithmic principles advocated perform significantly better than the others and are competitive with graph analytics frameworks running on production clusters.

Sorting with Asymmetric Read and Write Costs

This paper considers the PRAM model with asymmetric write cost, and presents write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication, which yield provably good bounds for parallel machines with private caches or with a shared cache.

Parallel Algorithms for Asymmetric Read-Write Costs

A nested-parallel model of computation is presented that combines a small per-task stack-allocated memories with symmetric read-write costs and an unbounded heap- allocated shared memory with asymmetric reading costs, and how the costs in the model map efficiently onto a more concrete machine model under a work-stealing scheduler is shown.
...