Implementing fault-tolerant services using the state machine approach: a tutorial

  title={Implementing fault-tolerant services using the state machine approach: a tutorial},
  author={Fred B. Schneider},
  journal={ACM Comput. Surv.},
  • F. Schneider
  • Published 1 December 1990
  • Computer Science
  • ACM Comput. Surv.
The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed. 

Figures from this paper

Fault-tolerant static scheduling for real-time distributed embedded systems

A heuristic for producing automatically a distributed fault-tolerant schedule of a given data-flow algorithm onto a given distributed architecture with software redundancy of computations and time redundancy of data-dependencies is presented.

Protocols for fault-tolerant systems

  • V. Kumar
  • Computer Science
    TENCON '91. Region 10 International Conference on EC3-Energy, Computer, Communication and Control Systems
  • 1991
The paper describes peer communication protocols assuming the peer entity to be a group consisting of sub-entities, which resulted in simplified group communication protocols needed to synchronize and manage the group.

Reconfiguring a state machine

This work explains several methods for reconfiguring a system implemented using the state-machine approach, including some new ones, and discusses the relation between these methods and earlier reconfiguration algorithms--especially view changing in group communication.

Towards Modeling and Model Checking Fault-Tolerant Distributed Algorithms

Model checking state-of-the-art fault-tolerant distributed algorithms (such as Paxos) is currently out of reach except for very small systems.

Model checking fault tolerant systems

A general framework for the formal specification and verification of fault tolerant systems is defined starting from these principles, and experience with its application to two case studies is presented.

Distributed computing column 37: reconfiguring state machines ... and the history of common knowledge

This work explains several methods for reconfiguring a system implemented using the state-machine approach, including some new ones, and discusses the relation between these methods and earlier reconfiguration algorithms—especially view changing in group communication.

A Formal Model for Fault-Tolerance in Distributed Systems

A formal method based on graph rewriting systems for the specifications and the proofs of fault-tolerant distributed algorithms using correction rules in the initial graph rewriting system used to encode the distributed algorithm.

A Perspective on the State of Research in Fault-Tolerant Systems.

The current state of fault-tolerance research as it contributes to the dependability of computer systems is characterized and conjecture on future directions for this research area is conjecture.

Automatic reconfiguration in the presence of failures

The paper describes a new kind of distributed system service, the availability management service, responsible for ensuring that the critical services of a distributed system remain continuously



Fail-stop processors: an approach to designing fault-tolerant computing systems

A methodology that facilitates the design of fault-tolerant computing systems is presented. It is based on the notion of a fail-stop processor. Such a processor automatically halts in response to any

Fault-Tolerant Broadcasts

Replication and fault-tolerance in the ISIS system

Techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data are discussed.

Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.

Description d'une methode generale pour implementer un systeme reparti ayant n'importe quel degre desire de tolerance de panne, d'un solution au probleme «Bizantine Generals» sont assumes.

A Framework for Software Fault Tolerance in Real-Time Systems

This work proposes a straightforward pragmatic approach to software fault tolerance which takes advantage of the structure of real-time systems to simplify error recovery, and a classification scheme for errors is introduced.

Byzantine clock synchronization

An informal description is given of three fault-tolerant clock-synchronization algorithms. These algorithms work in the presence of arbitrary kinds of failure, including “two-faced” clocks. Two of

Reliable communication in the presence of failures

A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by the approach.

Impossibility of distributed consensus with one faulty process

In this paper, it is shown that every protocol for this problem has the possibility of nontermination, even with only one faulty process.

Highly available distributed services and fault-tolerant distributed garbage collection

A fault-tolerant garbage collection method for a distributed heap that can be used in any application in which the property of interest is stable: once the property becomes true, it remains true forever.

Time, clocks, and the ordering of events in a distributed system

A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events, and a bound is derived on how far out of synchrony the clocks can become.