Implementing fault-tolerant services using the state machine approach: a tutorial

@article{Schneider1990ImplementingFS,
  title={Implementing fault-tolerant services using the state machine approach: a tutorial},
  author={Fred B. Schneider},
  journal={ACM Comput. Surv.},
  year={1990},
  volume={22},
  pages={299-319}
}
  • F. Schneider
  • Published 1 December 1990
  • Computer Science
  • ACM Comput. Surv.
The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed. 
Fault-tolerant static scheduling for real-time distributed embedded systems
TLDR
A heuristic for producing automatically a distributed fault-tolerant schedule of a given data-flow algorithm onto a given distributed architecture with software redundancy of computations and time redundancy of data-dependencies is presented.
Protocols for fault-tolerant systems
  • V. Kumar
  • Computer Science
    TENCON '91. Region 10 International Conference on EC3-Energy, Computer, Communication and Control Systems
  • 1991
TLDR
The paper describes peer communication protocols assuming the peer entity to be a group consisting of sub-entities, which resulted in simplified group communication protocols needed to synchronize and manage the group.
Reconfiguring a state machine
TLDR
This work explains several methods for reconfiguring a system implemented using the state-machine approach, including some new ones, and discusses the relation between these methods and earlier reconfiguration algorithms--especially view changing in group communication.
Towards Modeling and Model Checking Fault-Tolerant Distributed Algorithms
TLDR
Model checking state-of-the-art fault-tolerant distributed algorithms (such as Paxos) is currently out of reach except for very small systems.
Model checking fault tolerant systems
TLDR
A general framework for the formal specification and verification of fault tolerant systems is defined starting from these principles, and experience with its application to two case studies is presented.
Replicated servers for fault-tolerant real-time systems using transputers
TLDR
The design, implementation, and proof of a fault-tolerant server in a transputer network, developed in Occam with in-line GUY code at certain places to improve performance.
Distributed computing column 37: reconfiguring state machines ... and the history of common knowledge
TLDR
This work explains several methods for reconfiguring a system implemented using the state-machine approach, including some new ones, and discusses the relation between these methods and earlier reconfiguration algorithms—especially view changing in group communication.
A Formal Model for Fault-Tolerance in Distributed Systems
TLDR
A formal method based on graph rewriting systems for the specifications and the proofs of fault-tolerant distributed algorithms using correction rules in the initial graph rewriting system used to encode the distributed algorithm.
Automatic reconfiguration in the presence of failures
The paper describes a new kind of distributed system service, the availability management service, responsible for ensuring that the critical services of a distributed system remain continuously
A Perspective on the State of Research in Fault-Tolerant Systems.
TLDR
The current state of fault-tolerance research as it contributes to the dependability of computer systems is characterized and conjecture on future directions for this research area is conjecture.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 55 REFERENCES
Fail-stop processors: an approach to designing fault-tolerant computing systems
A methodology that facilitates the design of fault-tolerant computing systems is presented. It is based on the notion of a fail-stop processor. Such a processor automatically halts in response to any
Fault-Tolerant Broadcasts
A distributed program is presented that ensures delivery of a message to the functioning processors in a computer network, despite the fact that processors may fail at any time. All processor
Replication and fault-tolerance in the ISIS system
TLDR
Techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data are discussed.
Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.
TLDR
Description d'une methode generale pour implementer un systeme reparti ayant n'importe quel degre desire de tolerance de panne, d'un solution au probleme «Bizantine Generals» sont assumes.
Distributed Systems: Methods and Tools for Specification, An Advanced Course, April 3-12, 1984 and April 16-25, 1985, Munich, Germany
TLDR
The argus language and system, a graph model based approach to specifications, and issues and tools for protocol specification.
A Framework for Software Fault Tolerance in Real-Time Systems
TLDR
This work proposes a straightforward pragmatic approach to software fault tolerance which takes advantage of the structure of real-time systems to simplify error recovery, and a classification scheme for errors is introduced.
Implementing Fault-Tolerant Sensors*
TLDR
This paper presents a methodology for transforming a process control program that cannot tolerate sensor failure to one that can, and a hierarchy of fMlure models is identified.
Byzantine clock synchronization
An informal description is given of three fault-tolerant clock-synchronization algorithms. These algorithms work in the presence of arbitrary kinds of failure, including “two-faced” clocks. Two of
Reliable communication in the presence of failures
TLDR
A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by the approach.
Impossibility of distributed consensus with one faulty process
TLDR
In this paper, it is shown that every protocol for this problem has the possibility of nontermination, even with only one faulty process.
...
1
2
3
4
5
...