Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

  title={Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs},
  author={Morgan K. Geldenhuys and L. Thamsen and O. Kao},
  journal={2020 IEEE International Conference on Big Data (Big Data)},
Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered to. Typically, systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing Checkpoint and Rollback Recovery. However, this is an expensive operation which impacts negatively on the overall performance of the… Expand

Figures and Tables from this paper


A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems
  • 3
  • PDF
FTI: High performance Fault Tolerance Interface for hybrid systems
  • 278
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model
  • 41
  • PDF
Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads
  • 8
  • PDF
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
  • 484
  • PDF
The design and implementation of a multi-level content-addressable checkpoint file system
  • 9
  • PDF
Distributed Diskless Checkpoint for Large Scale Systems
  • 71
  • PDF
Large-scale cluster management at Google with Borg
  • 827
  • PDF
Low-overhead diskless checkpoint for hybrid computing systems
  • 22
  • PDF
Apache Hadoop YARN: yet another resource negotiator
  • 1,616
  • PDF