Reliable communication in the presence of failures

  title={Reliable communication in the presence of failures},
  author={Kenneth P. Birman and Thomas A. Joseph},
  journal={ACM Trans. Comput. Syst.},
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In… 

Figures from this paper

A Set of Multicast Primitives for Fault Tolerant Distributed Systems

A novel algorithmic approach is introduced through which the normal processing of the messages can be performed together with the recovery actions that are required to cope with failures, and under failure conditions the algorithms perform better in terms of both network load and throughput.

Consul: a communication substrate for fault-tolerant distributed programs

This dissertation introduces Consul, a communication substrate designed to help improve system dependability by providing a platform for building fault-tolerant, distributed systems based on the replicated state machine approach and shows that the semantic based order is more efficient than a total order in many situations.

The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems

This dissertation presents techniques for deciding how strongly ordered a protocol is necessary to solve a given application problem and introduces the concept of a linearization function that maps partially ordered sets of events to totally ordered histories.

Causal ordering in reliable group communications

The mechanism the algorithm devised to recover from crash failures, that avoids resorting to specialized protocols, performs better than other proposals in terms of both network load and throughput without affecting the performances under reliable conditions.

Reliable broadcast for fault-tolerance on local computer networks

The authors discuss the definition and design of a generic reliable communication architecture on a widely used host-independent platform, such as a local area network (LAN), and the use of nonreplicated LANs and self-checking components.

An Asynchronous Membership Protocolthat

The membership protocol presented here is integrated in the communication system, such that the notiications of membership changes are delivered to the application among the stream of regular messages in the system.

Fault-tolerant distributed systems based on broadcast communication

  • P. Melliar-SmithL. Moser
  • Computer Science
    [1989] Proceedings. The 9th International Conference on Distributed Computing Systems
  • 1989
An approach is presented to the design of fault-tolerant distributed systems that avoids this message exchange, resulting in systems that are substantially more efficient, based on broadcast communication over a local area network such as the Ethernet, and on two novel protocols.

Atomic Broadcast in Heterogeneous Distributed Systems

A global standard protocol that orchestrates cooperation between the different reliable broadcast protocols that run on different LANs and is capable of interacting with any reliable protocol that achieves a causal order as well as with all timestamp-based total-order protocols.

Reliable broadcasts and communication models: tradeoffs and lower bounds

The lower bound results identify a time complexity gap between systems where processors may only fail to send messages, and systems where they may fail both to send and to receive messages.

Early delivery totally ordered multicast in asynchronous environments

Experimental results show up to O(log (n)) speedup over previous protocols, which matches the authors' prediction of the expected speedup.



Replicated distributed programs

Repl i ca ted Di s t r ibuted P r o g r a m s P r O g rA m s is a good place to start if you want to learn more about how to deal with the aftermath of a natural disaster.

Low cost management of replicated data in fault-tolerant distributed systems

A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated, which results in better response time when performing operations on replicated data.

Replication and fault-tolerance in the ISIS system

Techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data are discussed.

An efficient, fault-tolerant protocol for replicated data management

A data management protocol for executing transactions on a replicated database that ensures one-copy serializability and tolerates a large class of failures, including: processor and communication link crashes, partitioning of the communication network, lost messages, and slow responses of processors and communication links.

Fault-Tolerant Broadcasts

Programming with Shared Bulletin Boards in Asynchronus Distributed Systems

This paper formalizes the notion of consistent behavior when unreliable processes concurrently access a bulletin board and provides a mechanism for reasoning about consistency in distributed systems, which was previously lacking.

Time, clocks, and the ordering of events in a distributed system

A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events, and a bound is derived on how far out of synchrony the clocks can become.

Fail-stop processors: an approach to designing fault-tolerant computing systems

A methodology that facilitates the design of fault-tolerant computing systems is presented. It is based on the notion of a fail-stop processor. Such a processor automatically halts in response to any

Concurrency Control in Distributed Database Systems

This paper describes a decomposition of the concurrency control problem into two major subproblems: read-write and write-write synchronization, and describes a series of synchromzation techniques for solving each subproblem and how to combine these techniques into algorithms for solving the entire conccurrency control problem.

Reliable broadcast protocols

A reliable broadcast protocol for an unreliable broadcast network is described that isolates the application programs from the unreliable characteristics of the communication network and can be used to simplify distributed database systems and distributed processing algorithms.