Benchmarking Distributed Stream Processing Engines

@inproceedings{Karimov2018BenchmarkingDS,
  title={Benchmarking Distributed Stream Processing Engines},
  author={Jeyhun Karimov and Tilmann Rabl and Asterios Katsifodimos and Roman S. Samarev and Henri Heiskanen and Volker Markl},
  booktitle={ICDE},
  year={2018}
}
Over the last years, stream data processing has been gaining attention both in industry and in academia due to its wide range of applications. [] Key Method First, we give a definition of latency and throughput for stateful operators. Second, we completely separate the system under test and driver, so that the measurement results are closer to actual system performance under real conditions. Third, we build the first driver to test the actual sustainable performance of a system under test. Our detailed…

Evaluation of Stream Processing Frameworks

TLDR
The relationship between latency, throughput, and resource consumption, and the performance impact of adding different common operations to the pipeline is analyzed and the results show that the latency disadvantages of using a micro-batch system are most apparent for stateless operations.

Evaluation of Stream Processing Frameworks

TLDR
The relationship between latency, throughput, and resource consumption, and the performance impact of adding different common operations to the pipeline is analyzed and the results show that the latency disadvantages of using a micro-batch system are most apparent for stateless operations.

Darwin: Scale-In Stream Processing

TLDR
Darwin, the authors' scale-in SPE prototype that tailors its execution towards arbitrary target environments through compiling stream processing queries while recoverable larger-than-memory state management, is presented.

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

TLDR
This work presents an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry and shows that the most cost-effective solution depends on the dataset size.

Benchmarking Synchronous and Asynchronous Stream Processing Systems

TLDR
To understand the upper bound of the maximum sustainable throughput that is possible for a given node configuration, multiple hard-coded multi-threaded processes (called ad-hoc dataflows) in C++ are designed using Message Passing Interface (MPI) and Pthread libraries, for two use-cases, such that they could collectively process an input stream based on the logic of the use-case.

Scalable Analytics on Fast Data

TLDR
This article explores extensions to database systems to match the performance and usability of streaming systems and focuses on main-memory database systems, such as HyPer, which are well-suited for analytical streaming workloads.

Big SQL systems: an experimental evaluation

TLDR
An extensive experimental study of four popular systems in this domain, namely, Apache Hive, SPARK SQL, Apache Impala and PrestoDB is presented and the performance characteristics of these systems are analyzed using three different benchmarks.

Performance Characterization and Modeling of Serverless and HPC Streaming Applications

  • André LuckowS. Jha
  • Computer Science
    2019 IEEE International Conference on Big Data (Big Data)
  • 2019
TLDR
Pilot-Streaming is extended to support serverless platforms, and it is demonstrated that StreamInsight provides an accurate model for a variety of application characteristics, e.

Scotty: General and Efficient Open-source Window Aggregation for Stream Processing Systems

TLDR
Scotty is presented, an efficient and general open-source operator for sliding-window aggregation in stream processing systems, such as Apache Flink, Apache Beam, Apache Samza, Apache Kafka, Apache Spark, and Apache Storm and one can easily extend Scotty with user-defined aggregation functions and window types.
...

References

SHOWING 1-10 OF 25 REFERENCES

Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks

TLDR
This paper raises the benchmark definition Stream Bench regarding the requirements, proposes a message system functioning as a mediator between stream data generation and consumption, and applies it to two popular frameworks, Apache Storm and Apache Spark Streaming.

A Performance Comparison of Open-Source Stream Processing Platforms

TLDR
Results show that the performance of native stream processing systems, Storm and Flink, is up to 15 times higher than the micro-batch processing system, Spark Streaming, and Spark Streaming is more robust to node failures and provides recovery without losses.

Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming

TLDR
A streaming benchmark for three representative computation engines: Flink, Storm and Spark Streaming is developed and a performance comparison of the three data engines in terms of 99th percentile latency and throughput for various configurations is provided.

SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark

TLDR
This paper presents SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications, including machine learning, graph computation, SQL query and streaming applications, and evaluates the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.

BigDataBench: A big data benchmark suite from internet services

  • Lei WangJianfeng Zhan Bizhu Qiu
  • Computer Science
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
  • 2014
TLDR
The big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets, and comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs.

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

TLDR
This project has two main goals: making few of community accepted benchmarks easily reproducible on cloud and validate the performance claimed by those studies.

Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks

TLDR
A fine characterization of the cases when each framework is superior is performed, and how this performance correlates to operators, to resource usage and to the specifics of the internal framework design is highlighted.

TimeStream: reliable stream computation in the cloud

TLDR
This work advocates a powerful new abstraction called resilient substitution that caters to the specific needs in this new computation model to handle failure recovery and dynamic reconfiguration in response to load changes.

S4: Distributed Stream Computing Platform

TLDR
The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers.

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

TLDR
One such approach is presented, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.