An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems

  title={An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems},
  author={Diganta Goswami and S. Sahu},
Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take… 
Comparing Distributed Online Stream Processing Systems Considering Fault Tolerance Issues
This paper presents an analysis of four online stream processing systems (MillWheel, S4, Spark Streaming and Storm) regarding the strategies they use for fault tolerance and discusses the advantages and disadvantages of the combination of the strategies for faultolerance.
FESC: Functionally Equivalent Service Composition
A knowledge-based system using a meta-reasoner tree is used to obtain a functionally equivalent service corresponding to an unavailable service and is validated using a smart cooking system.


Checkpointing and Rollback-Recovery for Distributed Systems
  • R. Koo, S. Toueg
  • Computer Science
    IEEE Transactions on Software Engineering
  • 1987
This work describes a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state by tolerating failures that occur during their executions.
A timestamp-based checkpointing protocol for long-lived distributed computations
  • F. Cristian, F. Jahanian
  • Computer Science
    [1991] Proceedings Tenth Symposium on Reliable Distributed Systems
  • 1991
A timestamp-based protocol for checkpointing the global state of a long-lived distributed computation in an environment in which processor clocks are approximately synchronized, which avoids the domino effect by recovering to the most recent successful local checkpoint.
Concurrent robust checkpointing and recovery in distributed systems
  • P. Leu, B. Bhargava
  • Computer Science
    Proceedings. Fourth International Conference on Data Engineering
  • 1988
The algorithm is resilient to multiple process failures, and handles network partitioning in a pessimistic way, and the algorithm does not require that messages be received in the order in which they are sent.
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System
An approach to checkpointing and rollback recovery in a distributed computing system using a common time base and the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement.
A low-overhead recovery technique using quasi-synchronous checkpointing
  • D. Manivannan, M. Singhal
  • Computer Science
    Proceedings of 16th International Conference on Distributed Computing Systems
  • 1996
A quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it that preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery.
On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems
  • G. Cao, M. Singhal
  • Computer Science
    Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205)
  • 1998
It is proved that there does not exist a non-blocking algorithm that forces only a minimum number of processes to take their checkpoints, and an efficient algorithm is proposed which dramatically reduces the blocking time during the checkpointing process.
Adaptive independent checkpointing for reducing rollback propagation
  • Jian Xu, R. Netzer
  • Computer Science
    Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing
  • 1993
An adaptive checkpointing algorithm to practically eliminate rollback propagation for independent checkpointing is presented, based on proofs of the conditions necessary and sufficient for a checkpoint to belong to some consistent global checkpoint, previously an open question.
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
A rollback recovery scheme for distributed systems that will force a minimum number of nodes to roll back after failures is developed and an interprocess communication protocol which encodes state-save progress information within message frames is introduced.
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
A synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection, and a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s).
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems
By using the approach of recovery line transformation and decomposition, this paper develops an optimal checkpoint space reclamation algorithm and shows that the space overhead for uncoordinated checkpointing is in fact bounded by N+1)/2 checkpoints where N is the number of processes.