Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

@article{Bader2021TaremaAR,
  title={Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters},
  author={Jonathan Bader and Lauritz Thamsen and Svetlana Kulagina and Jonathan Will and Henning Meyerhenke and Odej Kao},
  journal={2021 IEEE International Conference on Big Data (Big Data)},
  year={2021},
  pages={65-75}
}
Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However, these resource managers only consider the number of CPUs and the amount of available memory when… 

Figures and Tables from this paper

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

TLDR
Lotaru is presented, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters that significantly outperforms the baselines in terms of prediction errors for homogeneous and heterogeneity clusters.

Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

TLDR
Reshi is proposed, a method for recommending task-node assignments during workflow execution that can cope with heterogeneous tasks and heterogeneous nodes that outperforms HEFT and compares with three state-of-the-art schedulers.

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

TLDR
This paper describes how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models, and outlines approaches to performance prediction via more context-aware and reusable models.

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

TLDR
C Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration with enough total memory, and sees a reduction of job execution costs by 56% compared to the baseline, while on average spending less than ten minutes on pro-�ling runs per job on a consumer-grade laptop.

On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

TLDR
This paper proposes a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations and evaluates the prototype implementation for mining workload execution graphs on a publicly available trace dataset.

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

TLDR
This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate, and indicates that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction.

Kubernetes Scheduling: Taxonomy, ongoing issues and challenges

TLDR
A study of empirical research on Kubernetes scheduling techniques and a new taxonomy for Kubernets scheduling are conducted to establish insight knowledge and find the main gaps, and thus guide future research in the area.

References

SHOWING 1-10 OF 44 REFERENCES

Feedback-Based Resource Allocation for Batch Scheduling of Scientific Workflows

TLDR
This paper investigates the possibility to improve upon inaccurate user estimates by incorporating an online feedback loop between workflow scheduling, resource usage prediction, and measurement, and demonstrates its effectiveness by predicting peak memory usage of tasks.

Selecting resources for distributed dataflow systems according to runtime targets

TLDR
This paper presents Bell, a practical system that monitors job execution, models the scale-out behavior of jobs based on previous runs, and selects resources according to user-provided runtime targets, and concludes that the model selection approach provides better overall performance than the individual prediction models.

Parallelization in Scientific Workflow Management Systems

TLDR
The survey gives an overview of parallelization techniques for SWfMS, both in theory and in their realization in concrete systems, and finds that current systems leave considerable room for improvement and proposes key advancements to the landscape ofSWfMS.

SAASFEE: Scalable Scientific Workflow Execution Engine

TLDR
SAASFEE is presented, a SWfMS which runs arbitrarily complex workflows on Hadoop YARN and has the ability to execute iterative workflows, an adaptive task scheduler, re-executable provenance traces, and compatibility to selected other workflow systems.

A Heterogeneity-Aware Task Scheduler for Spark

TLDR
RUPAM is presented, a heterogeneity-aware task scheduling system for big data platforms, which considers both task-level resource characteristics and underlying hardware characteristics, as well as preserves data locality.

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

TLDR
This work proposes a collaborative approach for finding optimal cluster configurations based on sharing and learning from historical runtime data of distributed dataflow jobs, and suggests a good cluster configuration avoids hardware bottlenecks and maximizes resource utilization, avoiding costly overprovisioning.

A taxonomy and survey on scheduling algorithms for scientific workflows in IaaS cloud computing environments

TLDR
This work identifies challenges and studies existing algorithms from the perspective of the scheduling models they adopt as well as the resource and application model they consider, and a detailed taxonomy that focuses on features particular to clouds is presented.

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

TLDR
This work presents C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data, utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models.

Scheduling Scientific Workflows Elastically for Cloud Computing

  • Cui LinShiyong Lu
  • Computer Science
    2011 IEEE 4th International Conference on Cloud Computing
  • 2011
TLDR
The preliminary experiments show that SHEFT not only outperforms several representative workflow scheduling algorithms in optimizing workflow execution time, but also enables resources to scale elastically at runtime.

Mary, Hugo, and Hugo*: Learning to schedule distributed data‐parallel processing jobs on shared clusters

TLDR
The results of these experiments show that the approach based on reinforcement learning and a measure of co-location goodness to have cluster schedulers learn over time which jobs are best executed together on shared resources can increase resource utilization and job throughput significantly.