START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks

@article{Tuli2021STARTSP,
  title={START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks},
  author={Shreshth Tuli and Sukhpal Singh Gill and Peter Garraghan and Rajkumar Buyya and Giuliano Casale and Nicholas R. Jennings},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.10241}
}
Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system’s Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic… 

MCDS: AI Augmented Workflow Scheduling in Mobile Edge Cloud Computing Systems

MCDS is an Artificial Intelligence (AI) based scheduling approach that uses a tree-based search strategy and a deep neural network based surrogate model to estimate the long-term QoS impact of immediate actions for robust optimization of scheduling decisions.

AI Augmented Edge and Fog Computing: Trends and Challenges

This survey reviews the evolution of data-driven AI-augmented technologies and their impact on computing systems, and presents the latest trends and impact areas such as optimizing AI models that are deployed on or for computing systems.

References

SHOWING 1-10 OF 48 REFERENCES

Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation

This paper presents an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization.

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

A comprehensive review of straggler management techniques within large-scale cloud data centres is presented, which provides a detailed taxonomy of stragler causes, as well as proposed management and mitigation techniques based on straggle characteristics and properties.

Wrangler: Predictable and Faster Jobs using Fewer Resources

For production-level workloads from Facebook and Cloudera's customers, Wrangler improves the 99th percentile job completion time by up to 61% as compared to speculative execution, a widely used straggler mitigation technique.

A novel approach to workload prediction using attention-based LSTM encoder-decoder network in cloud environment

An approach using the long short-term memory (LSTM) encoder-decoder network with attention mechanism to improve the workload prediction accuracy and a scroll prediction method, which splits a long prediction sequence into several small sequences to monitor and control prediction accuracy.

RPPS: A Novel Resource Prediction and Provisioning Scheme in Cloud Data Center

RPPS (Cloud Resource Prediction and Provisioning scheme), a scheme that automatically predict future demand and perform proactive resource provisioning for cloud applications, employs the ARIMA model to predict the workloads in the future, combines both coarse- grained and fine-grained resource scaling under different situations, and adopts a VM-complementary migration strategy.

COSCO: Container Orchestration Using Co-Simulation and Gradient Based Optimization for Fog Computing Environments

A Gradient Based Optimization Strategy using Back-propagation of gradients with respect to Input (GOBI) and a hybrid simulation driven decision approach, GOBI*, to optimize Quality of Service (QoS) parameters to adapt quickly in volatile environments are proposed.

Optimization for Speculative Execution in Big Data Processing Clusters

  • Huanle XuW. Lau
  • Computer Science
    IEEE Transactions on Parallel and Distributed Systems
  • 2017
This paper analyze and propose one cloning scheme, namely, the Smart Cloning Algorithm (SCA), and derive the workload threshold under which SCA should be used for speculative execution, and proposes the Enhanced Speculative Execution (ESE) algorithm which is an extension of the Microsoft Mantri scheme to mitigate stragglers.

PRISM: An Experiment Framework for Straggler Analytics in Containerized Clusters

This paper proposes PRISM, a framework that automates containerized cluster setup, experiment design, and experiment execution, and uses it to conduct automated experimentation of system operational conditions and identify straggler manifestation is affected by resource contention, input data size and scheduler architecture limitations.

A GRU-Based Prediction Framework for Intelligent Resource Management at Cloud Data Centres in the Age of 5G

An intelligent prediction framework named IGRU-SD (Improved Gated Recurrent Unit with Stragglers Detection) based on state-of-art data analytics and Artificial Intelligence techniques, aimed at predicting the anticipated level of resource requests over a period of time into the future is proposed.

On data skewness, stragglers, and MapReduce progress indicators

A novel profile-guided progress indicator that operates without the linear hypothesis assumption in a fully online way (i.e., without resorting to profile data collected from previous executions), called NearestFit, which exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques.