Straggler Mitigation at Scale

@article{Akta2019StragglerMA,
  title={Straggler Mitigation at Scale},
  author={Mehmet Fatih Aktaş and Emina Soljanin},
  journal={IEEE/ACM Transactions on Networking},
  year={2019},
  volume={27},
  pages={2266-2279}
}
  • M. Aktaş, E. Soljanin
  • Published 25 June 2019
  • Computer Science, Mathematics
  • IEEE/ACM Transactions on Networking
Runtime performance variability has been a major issue, hindering predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers have been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed the pain and gain of redundancy under various service models and assumptions on the runtime variability. This paper… 
Diversity vs. Parallelism in Distributed Computing with Redundancy
TLDR
This work characterize the diversity vs. parallelism tradeoff for three common models of task size dependent execution times and finds that different models operate optimally at different levels of redundancy, and thus may require very different code rates.
Data Replication for Reducing Computing Time in Distributed Systems with Stragglers
TLDR
This work studies the optimal replication of data in systems where the job execution time is a stochastically decreasing and convex random variable and derives the optimum redundancy levels for minimizing both expected value and the variance of the job completion time for Exponential and Shifted-Exponential service times.
Modeling and Optimization of Latency in Erasure-coded Storage Systems
TLDR
This monograph provides a review of recent progress on systems that employ erasure codes for distributed storage, and discusses exemplary implementations of erasure-coded storage, illuminate key design degrees of freedom and tradeoffs, and summarize remaining challenges in real-world storage systems such as in content delivery and caching.
Analyzing the Download Time of Availability Codes
Availability codes have recently been proposed to facilitate efficient storage, management, and retrieval of frequently accessed data in distributed storage systems. Such codes provide multiple
Redundancy Scheduling in Systems with Bi-Modal Job Service Time Distributions
  • Amir Behrouzi-Far, E. Soljanin
  • Computer Science, Mathematics
    2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2019
TLDR
This work develops an analogy to a classical urns and balls problem, and uses it to study the queuing time performance of two non-adaptive classical scheduling policies: random and round-robin.
START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks
TLDR
A Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times and reduces execution time, resource contention, energy and SLA violations.
Towards Performance Modeling of Speculative Execution for Cloud Applications
TLDR
This paper presents a performance model of cloud applications that utilize the latter concept, known as speculative execution, and studies the popular Join-Shortest-Queue load-balancing strategy under the processor sharing queuing discipline.
Causal and Interpretable Learning for Datacenter Latency Prediction
Stragglers—computations that exhibit extreme tail latencies—present a major challenge to delivering predictable performance in datacenters. Accurately predicting stragglers would enable efficient,
Timely Distributed Computation With Stragglers
TLDR
This work investigates the age performance of uncoded and coded schemes in the presence of stragglers under i.i.d. exponential transmission delays and shows that asymptotically MM-MDS coded scheme outperforms the other schemes.
Timely Distributed Computation with Stragglers Baturalp
We consider a status update system in which the update packets need to be processed to extract the embedded useful information. The source node sends the acquired information to a computation unit
...
1
2
3
...

References

SHOWING 1-10 OF 60 REFERENCES
Straggler Mitigation by Delayed Relaunch of Tasks
TLDR
It is found that coded redundancy achieves better cost vs. latency tradeoff than simple replication and can yield reduction in both cost and latency under less heavy tailed execution times.
Effective Straggler Mitigation
TLDR
This work finds that coded redundancy achieves better cost vs. latency and allows for greater achievable latency and cost tradeoff region compared to replication and can yield reduction in both cost and latency under less heavy tailed execution times.
A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size
TLDR
A dispatching policy, Redundant-to-Idle-Queue, is designed, which is both analytically tractable within the <inline-formula> <tex-math notation="LaTeX">$S\&X$ </tex-Math></inline- formula> model and has provably excellent performance.
Effective Straggler Mitigation: Attack of the Clones
TLDR
Evaluation of the proposed system, Dolly, using production workloads shows that the small jobs speedup by 34% to 46% after state-of-the-art mitigation techniques have been applied, using just 5% extra resources for cloning.
Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale
TLDR
This work presents Hopper, a job scheduler that is speculation-aware, i.e., that integrates the tradeoffs associated with speculation into job scheduling decisions and shows that 50% improvements over state-of-the-art centralized (decentralized) schedulers and speculation strategies can be achieved through the coordination of scheduling and speculation.
Proactive Straggler Avoidance using Machine Learning
The MapReduce architecture provides self-managed parallelization with fault tolerance for large-scale data processing. Stragglers, the tasks running slower than other tasks of a job, could
Low latency via redundancy
TLDR
It is argued that the use of redundancy is an effective way to convert extra capacity into reduced latency by initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions.
Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing
TLDR
This work analyses how task replication reduces latency, and proposes a heuristic algorithm to search for the best replication strategies when it is difficult to model the empirical behavior of task execution time and uses the proposed analysis techniques.
Reducing late-timing failure at scale:straggler root-cause analysis in cloud datacenters
TLDR
The preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention, which can assist in enhancing stragglers prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.
Analysis of SRPT scheduling: investigating unfairness
TLDR
The degree of unfairness under SRPT is surprisingly small, and closed-form expressions for mean response time as a function of job size are proved in this setting.
...
1
2
3
4
5
...