An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems

@inproceedings{Goswami2004AnEP,
  title={An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems},
  author={Diganta Goswami and S. Sahu},
  booktitle={ICDCIT},
  year={2004}
}
Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take… 
Comparing Distributed Online Stream Processing Systems Considering Fault Tolerance Issues
TLDR
This paper presents an analysis of four online stream processing systems (MillWheel, S4, Spark Streaming and Storm) regarding the strategies they use for fault tolerance and discusses the advantages and disadvantages of the combination of the strategies for faultolerance.
FESC: Functionally Equivalent Service Composition
TLDR
A knowledge-based system using a meta-reasoner tree is used to obtain a functionally equivalent service corresponding to an unavailable service and is validated using a smart cooking system.

References

SHOWING 1-10 OF 33 REFERENCES
Checkpointing and Rollback-Recovery for Distributed Systems
  • R. Koo, S. Toueg
  • Computer Science
    IEEE Transactions on Software Engineering
  • 1987
TLDR
This work describes a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state by tolerating failures that occur during their executions.
A timestamp-based checkpointing protocol for long-lived distributed computations
  • F. Cristian, F. Jahanian
  • Computer Science
    [1991] Proceedings Tenth Symposium on Reliable Distributed Systems
  • 1991
TLDR
A timestamp-based protocol for checkpointing the global state of a long-lived distributed computation in an environment in which processor clocks are approximately synchronized, which avoids the domino effect by recovering to the most recent successful local checkpoint.
Concurrent robust checkpointing and recovery in distributed systems
  • P. Leu, B. Bhargava
  • Computer Science
    Proceedings. Fourth International Conference on Data Engineering
  • 1988
TLDR
The algorithm is resilient to multiple process failures, and handles network partitioning in a pessimistic way, and the algorithm does not require that messages be received in the order in which they are sent.
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System
TLDR
An approach to checkpointing and rollback recovery in a distributed computing system using a common time base and the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement.
A low-overhead recovery technique using quasi-synchronous checkpointing
  • D. Manivannan, M. Singhal
  • Computer Science
    Proceedings of 16th International Conference on Distributed Computing Systems
  • 1996
TLDR
A quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it that preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery.
On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems
  • G. Cao, M. Singhal
  • Computer Science
    Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205)
  • 1998
TLDR
It is proved that there does not exist a non-blocking algorithm that forces only a minimum number of processes to take their checkpoints, and an efficient algorithm is proposed which dramatically reduces the blocking time during the checkpointing process.
Adaptive independent checkpointing for reducing rollback propagation
  • Jian Xu, R. Netzer
  • Computer Science
    Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing
  • 1993
TLDR
An adaptive checkpointing algorithm to practically eliminate rollback propagation for independent checkpointing is presented, based on proofs of the conditions necessary and sufficient for a checkpoint to belong to some consistent global checkpoint, previously an open question.
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
TLDR
A rollback recovery scheme for distributed systems that will force a minimum number of nodes to roll back after failures is developed and an interprocess communication protocol which encodes state-save progress information within message frames is introduced.
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
TLDR
A synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection, and a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s).
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems
TLDR
By using the approach of recovery line transformation and decomposition, this paper develops an optimal checkpoint space reclamation algorithm and shows that the space overhead for uncoordinated checkpointing is in fact bounded by N+1)/2 checkpoints where N is the number of processes.
...
1
2
3
4
...