Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale

  title={Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale},
  author={Zhuo Zhang and C. Li and Yangyu Tao and Renyu Yang and Hong Tang and Jie Xu},
  journal={Proc. VLDB Endow.},
Scalability and fault-tolerance are two fundamental challenges for all distributed computing at Internet scale. Despite many recent advances from both academia and industry, these two problems are still far from settled. In this paper, we present Fuxi, a resource management and job scheduling system that is capable of handling the kind of workload at Alibaba where hundreds of terabytes of data are generated and analyzed everyday to help optimize the company's business operations and user… 
Swift: Reliable and Low-Latency Data Processing at Cloud Scale
The experience with Swift is reported, a system capable of efficiently running real-time and interactive data processing jobs at cloud scale capable of supporting as many as 140,000 executors and processing millions of jobs per day.
Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters
ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs and Node agents of those machines can avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle.
TRIPOD: An Efficient, Highly-available Cluster Management System
The design of Tripod is proposed, a cluster management system that automatically provides high-availability to general applications, using a new Paxos replication protocol that leverages RDMA (Remote Direct Memory Access).
Shaready: A Resource-IsolatedWorkload Co-Location System
This work proposed an isolation-based cluster resource sharing system Shaready to enable workload co-residences and demonstrated that system CPU and memory utilization can be improved by roughly 50% and 16.67% respectively on average with at most 7% performance degradation.
Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases
A framework called Tempo is proposed that brings simplicity, self-tuning, and robustness to existing RMs, and optimizes the RM configuration settings to meet these objectives declaratively.
Adaptive Speculation for Efficient Internetware Application Execution in Clouds
An algorithm to improve the execution efficiency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization is presented.
Efficient Fault Tolerance Through Dynamic Node Replacement
A dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs is proposed that can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.
Computing at Massive Scale: Scalability and Dependability Challenges
  • Renyu Yang, Jie Xu
  • Computer Science
    2016 IEEE Symposium on Service-Oriented System Engineering (SOSE)
  • 2016
A data-driven analysis methodology for characterizing the resource and workload patterns and tracing performance bottlenecks in a massive-scale distributed computing environment and several fundamental challenges are examined, including incremental but decentralized resource scheduling, incremental messaging communication, rapid system failover, and request handling parallelism.
Who Limits the Resource Efficiency of My Datacenter: An Analysis of Alibaba Datacenter Traces
A straightforward way is co-locating different workloads on the same hardware and to figure out the resource efficiency and understand the key characteristics of workloads in co-located cluster, an 8-day trace from Alibaba's production trace is analyzed.
Dynamic Resource Management and Job Scheduling for High Performance Computing = Dynamisches Ressourcenmanagement und Job-Scheduling für das Hochleistungsrechnen
This thesis presents dynamic scheduling methods for evolving jobs and presents a unique scheduling technique for malleable jobs and an algorithm for the combined scheduling of all four types of jobs in a cluster environment, which improves the resiliency of cluster systems.


Failure data analysis of a large-scale heterogeneous server environment
This paper analyzes the empirical and statistical properties of system errors and failures from a network of nearly 400 heterogeneous servers running a diverse workload over a year and shows that the system error and failure patterns are comprised of time-varying behavior containing long stationary intervals.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
Apache Hadoop YARN: yet another resource negotiator
The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.
Omega: flexible, scalable schedulers for large compute clusters
This work presents a novel approach to address increasing scale and the need for rapid response to changing requirements using parallelism, shared state, and lock-free optimistic concurrency control to address monolithic cluster scheduler architectures.
Dryad: distributed data-parallel programs from sequential building blocks
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Sparrow: distributed, low latency scheduling
It is demonstrated that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design.
Distributed computing in practice: the Condor experience
The history and philosophy of the Condor project is provided and how it has interacted with other projects and evolved along with the field of distributed computing is described.
Basic concepts and taxonomy of dependable and secure computing
The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of systems failures.
Big data: the management revolution.
Big data, the authors write, is far more powerful than the analytics of the past. Executives can measure and therefore manage more precisely than ever before. They can make better predictions and
The tail at scale
Software techniques that tolerate latency variability are vital to building responsive large-scale Web services.