On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

  title={On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds},
  author={Dominik Scheinert and Alireza Alamgiralem and Jonathan Bader and Jonathan Will and Thorsten Wittkopp and Lauritz Thamsen},
  journal={2021 IEEE International Conference on Big Data (Big Data)},
With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from… 

Figures and Tables from this paper

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud
This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate, and indicates that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction.


Optimizing Machine Learning Workloads in Collaborative Environments
This paper presents a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results, and devise a linear-time reuse algorithm to find the optimal execution plan for incomingML workloads.
Heterogeneity and dynamicity of clouds at scale: Google trace analysis
Analysis of the first publicly available trace data from a sizable multi-purpose cluster finds that many longer-running jobs have relatively stable resource utilizations, which can help adaptive resource schedulers.
C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds
This work presents C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data, utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models.
Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics
Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.
Understanding the Workload Characteristics in Alibaba: A View from Directed Acyclic Graph Analysis
An in-depth analysis on the latest released trace dataset by Alibaba in December 2018, consists of 4195049 batch jobs and 71476 containers co-locating on about 4000 machines, which explains the relationship between the jobs and tasks and reveals several new insights.
Characterizing Co-located Datacenter Workloads: An Alibaba Case Study
Alibaba's co-located workload trace is analyzed, the first publicly available dataset with precise information about the category of each job, and reveals insights that are useful for system designers and IT practitioners working on cluster management systems.
Imbalance in the cloud: An analysis on Alibaba cluster trace
This paper performs a deep analysis on a newly released trace dataset by Alibaba in September 2017, consisting of detail statistics of 11089 online service jobs and 12951 batch jobs co-locating on 1300 machines over 12 hours, revealing several important insights about different types of imbalance in the Alibaba cloud.
Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts
Bellamy, a novel modeling approach that combines scale-outs, dataset sizes, and runtimes with additional descriptive properties of a dataflow job is presented, showing that Bellamy outperforms state-of-the-art methods.
Quick Execution Time Predictions for Spark Applications
This paper proposes an alternative approach called PERIDOT to accurately predict the performance of a variety of Spark applications spanning text analytics, linear algebra, machine learning and Spark SQL, and shows that a state-of-the-art machine learning based execution time prediction algorithm performs poorly when presented with such limited training data.
Finding the right cloud configuration for analytics clusters
Vanir, an optimization framework designed to operate in an ecosystem of multiple distributed systems forming an analytics cluster, is proposed, which can find deployments that perform comparably to the ones found by state-of-the-art single-system cloud configuration optimizers while spending 2X fewer benchmarking runs.