Big Data Benchmark Compendium

@inproceedings{Ivanov2015BigDB,
  title={Big Data Benchmark Compendium},
  author={Todor Ivanov and Tilmann Rabl and Meikel P{\"o}ss and Anna Queralt and John Poelman and Nicol{\'a}s Poggi and Jeffrey Buell},
  booktitle={TPCTC},
  year={2015}
}
The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks are emerging, driven by academia and industry alike. As these benchmarks are emphasizing different aspects of Big Data and, in many cases, covering different technical platforms and uses cases, it is extremely difficult to keep up with the pace of benchmark creation. Also with the combinations of large volumes of data, heterogeneous data formats and the changing processing velocity, it becomes… 
Big Data Data Management Systems performance analysis using Aloja and BigBench
TLDR
The ALOJA benchmarking platform is expanded and the expanded BigBench architecture allowed to detect a difference in task management between engines, by analyzing the de facto SQL Big Data engine: Hive, against the on-growing Spark-SQL.
BigBench V2: The New and Improved BigBench
TLDR
The proof of concept shows the feasibility of BigBench V2 and outlines different ways of implementing late binding, and a new scale factor-based data generator is implemented to produce structured tables, key-value semistructured web-logs, and unstructured data.
Application-Level Benchmarking of Big Data Systems
TLDR
This chapter gives an introduction to big data benchmarking and presents different proposals and standardization efforts to provide objective evaluations of alternative technologies and solution approaches to a given big data problem.
Classifying, evaluating and advancing big data benchmarks
TLDR
The thesis is an attempt to re-define system benchmarking taking into account the new requirements posed by the Big Data applications, with the explosion of Artificial Intelligence (AI) and new hardware computing power, this is a first step towards a more holistic approach to benchmarking.
TPCx-BB (Big Bench) in a Single-Node Environment
TLDR
This paper presents response times of all 30 BigBench queries when run sequentially to showcase the advanced analytics and machine learning capabilities integrated within SQL Server 2019, and presents results from data scalability experiments over two scale factors to understand the impact of increase in data size on query runtimes.
Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments
TLDR
The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O, and results show how Hive and Spark compare and what performance can be expected of each in PaaS.
Elkhan Shahverdi Comparative Evaluation for the Performance of Big Stream Processing Systems
TLDR
This thesis aims to conduct an empirical evaluation and benchmarking of the state-of-the-art of big stream processing systems to compare and contrast Apache Flink, Apache Storm, Heron, Kafka an Apache Spark stream processing engines.
A Big Data: Tools, Systems and Benchmarks
TLDR
The big data benchmarks are playing important role to evaluate the performance of big data system and some open issues related to benchmarking in terms of data generation techniques and workload are discussed.
Characterizing TPCx-BB queries , Hive , and Spark in multi-cloud environments
TLDR
The study characterizes TPCx-BB queries and the out-of-the-box performance of Spark and Hive versions in the cloud, comparing popular PaaS offerings, reliability, scalability, and performance, including Azure HDinsight, Amazon Web Services EMR, and Google Dataproc.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data
TLDR
The BigBench benchmark is presented and the suitability and relevance of the workload is evaluated from the point of view of enterprise applications, and potential extensions to the proposed specification are discussed in order to cover typical big data processing use cases.
A characterization of big data benchmarks
TLDR
The redundancy among benchmarks from ICTBench, HiBench and typical workloads from real world applications are analyzed to present an initial idea of a big data benchmark suite for spatio-temporal data.
BigDataBench: A big data benchmark suite from internet services
  • Lei Wang, Jianfeng Zhan, Bizhu Qiu
  • Computer Science
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
  • 2014
TLDR
The big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets, and comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs.
Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems
TLDR
TPCx-HS is the industry’s first industry standard benchmark, designed to stress both hardware and software that is based on Apache HDFS API compatible distributions, and can be used to asses a broad range of system topologies and implementation methodologies of Big Data Hadoop systems in a technically rigorous and directly comparable and vendor-neutral manner.
Issues in big data testing and benchmarking
TLDR
Initial solutions and challenges with respect to big data generation, methods for creating realistic, privacy-aware, and arbitrarily scalable data sets, workloads, and benchmarks from real world data are described.
From TPC-C to Big Data Benchmarks: A Functional Workload Model
TLDR
This position paper argues for building future big data benchmarks using what is called a "functional workload model", which draws on combined experiences from standard benchmarks, exemplified by TPC-C.
Parallel data generation for performance analysis of large, complex RDBMS
TLDR
This paper analyzes the requirements of today's data generators by analyzing the requirements and either explaining how the problems have been solved in existing data generators, or showing why they have not been solved yet.
Memory system characterization of big data workloads
TLDR
This paper develops an analysis methodology to understand how conventional optimizations such as caching, prediction, and prefetching may apply to Hadoop and noSQL big data workloads, and discusses the implications on software and system design.
A comparison of approaches to large-scale data analysis
TLDR
A benchmark consisting of a collection of tasks that are run on an open source version of MR as well as on two parallel DBMSs shows a dramatic performance difference between the two paradigms.
BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking
TLDR
This work develops a tool, called Big Data Generator Suite (BDGS), to efficiently generate scalable big data while employing data models derived from real data to preserve data veracity.
...
...