Graph Sampling with Distributed In-Memory Dataflow Systems

@article{Gmez2021GraphSW,
  title={Graph Sampling with Distributed In-Memory Dataflow Systems},
  author={Kevin G{\'o}mez and Matthias T{\"a}schner and Mohammadreza Rostami and Christopher Rost and Erhard Rahm},
  journal={ArXiv},
  year={2021},
  volume={abs/1910.04493}
}
Given a large graph, a graph sample determines a subgraph with similar characteristics for certain metrics of the original graph. The samples are much smaller thereby accelerating and simplifying the analysis and visualization of large graphs. We focus on the implementation of distributed graph sampling for Big Data frameworks and in-memory dataflow systems such as Apache Spark or Apache Flink. We evaluate the scalability of the new implementations and analyze to what degree the sampling… 

Figures and Tables from this paper

Distributed temporal graph analytics with GRADOOP
TLDR
The system architecture of Gradoop is presented, its data model TPGM with composable temporal graph operators, like snapshot, difference, pattern matching, graph grouping and several implementation details, and the performance and scalability of selected operators are evaluated.

References

SHOWING 1-10 OF 21 REFERENCES
Declarative and distributed graph analytics with GRADOOP
We demonstrate G radoop , an open source framework that combines and extends features of graph database systems with the benefits of distributed graph processing. Using a rich graph data model
Analyzing extended property graphs with Apache Flink
TLDR
The Extended Property Graph Model is proposed, which is semantically rich, schema-free and supports multiple distinct graphs and provides declarative and combinable operators to analyze both single graphs and graph collections.
BIGGR: Bringing Gradoop to Applications
TLDR
The BIGGR approach is introduced, providing a novel tool for the user-friendly and efficient analysis and visualization of Big Graph Data on top of the open-source software KNIME and gradoop and the distributed processing framework Apache Flink.
Analyzing Temporal Graphs with Gradoop
TLDR
This work extends the distributed graph analysis framework Gradoop for temporal graph analysis by adding time properties to vertices, edges and graphs and using them within graph operators, and outlines their use within analysis workflows.
A Visual Evaluation Study of Graph Sampling Techniques
TLDR
The study provides a practical guideline for visualizing big graphs of different sizes and structures and uses eight benchmark datasets with four different graphs collected from Stanford Network Analysis Platform and NetworkX to give a comprehensive comparison of various types of graphs.
Pregel: a system for large-scale graph processing
TLDR
A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Understanding Graph Sampling Algorithms for Social Network Analysis
  • Tianyi Wang, Yang Chen, Xing Li
  • Computer Science
    2011 31st International Conference on Distributed Computing Systems Workshops
  • 2011
TLDR
This paper analyzes the state-of art graph sampling algorithms and evaluates their performance on some widely recognized graph properties on directed graphs using large-scale social network datasets and finds that none of the algorithms is able to obtain satisfied sampling results in both of these properties.
A Survey and Taxonomy of Graph Sampling
TLDR
This survey discusses both classical text-book type properties and some advanced properties of graph sampling, and provides a taxonomy of different graph sampling objectives and graph sampling approaches.
The LDBC Social Network Benchmark: Interactive Workload
TLDR
This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies.
Sampling from large graphs
TLDR
The best performing methods are the ones based on random-walks and "forest fire"; they match very accurately both static as well as evolutionary graph patterns, with sample sizes down to about 15% of the original graph.
...
...