Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

  title={Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs},
  author={Morgan K. Geldenhuys and L. Thamsen and O. Kao},
  journal={2020 IEEE International Conference on Big Data (Big Data)},
Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered to. Typically, systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing Checkpoint and Rollback Recovery. However, this is an expensive operation which impacts negatively on the overall performance of the… Expand

Figures and Tables from this paper


A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems
FTI: High performance Fault Tolerance Interface for hybrid systems
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model
Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
The design and implementation of a multi-level content-addressable checkpoint file system
Distributed Diskless Checkpoint for Large Scale Systems
Large-scale cluster management at Google with Borg
Low-overhead diskless checkpoint for hybrid computing systems
Apache Hadoop YARN: yet another resource negotiator