Reliable communication in the presence of failures
@article{Birman1987ReliableCI, title={Reliable communication in the presence of failures}, author={Kenneth P. Birman and Thomas A. Joseph}, journal={ACM Trans. Comput. Syst.}, year={1987}, volume={5}, pages={47-76} }
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In…
1,172 Citations
A Set of Multicast Primitives for Fault Tolerant Distributed Systems
- Computer ScienceJ. High Speed Networks
- 1995
A novel algorithmic approach is introduced through which the normal processing of the messages can be performed together with the recovery actions that are required to cope with failures, and under failure conditions the algorithms perform better in terms of both network load and throughput.
Consul: a communication substrate for fault-tolerant distributed programs
- Computer ScienceDistributed Syst. Eng.
- 1993
This dissertation introduces Consul, a communication substrate designed to help improve system dependability by providing a platform for building fault-tolerant, distributed systems based on the replicated state machine approach and shows that the semantic based order is more efficient than a total order in many situations.
The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems
- Computer Science, Mathematics
- 1988
This dissertation presents techniques for deciding how strongly ordered a protocol is necessary to solve a given application problem and introduces the concept of a linearization function that maps partially ordered sets of events to totally ordered histories.
Causal ordering in reliable group communications
- Computer ScienceSIGCOMM '93
- 1993
The mechanism the algorithm devised to recover from crash failures, that avoids resorting to specialized protocols, performs better than other proposals in terms of both network load and throughput without affecting the performances under reliable conditions.
Reliable broadcast for fault-tolerance on local computer networks
- Computer ScienceProceedings Ninth Symposium on Reliable Distributed Systems
- 1990
The authors discuss the definition and design of a generic reliable communication architecture on a widely used host-independent platform, such as a local area network (LAN), and the use of nonreplicated LANs and self-checking components.
An Asynchronous Membership Protocolthat
- Computer Science
- 2007
The membership protocol presented here is integrated in the communication system, such that the notiications of membership changes are delivered to the application among the stream of regular messages in the system.
Fault-tolerant distributed systems based on broadcast communication
- Computer Science[1989] Proceedings. The 9th International Conference on Distributed Computing Systems
- 1989
An approach is presented to the design of fault-tolerant distributed systems that avoids this message exchange, resulting in systems that are substantially more efficient, based on broadcast communication over a local area network such as the Ethernet, and on two novel protocols.
Atomic Broadcast in Heterogeneous Distributed Systems
- Computer Science
- 1995
A global standard protocol that orchestrates cooperation between the different reliable broadcast protocols that run on different LANs and is capable of interacting with any reliable protocol that achieves a causal order as well as with all timestamp-based total-order protocols.
Reliable broadcasts and communication models: tradeoffs and lower bounds
- Computer ScienceDistributed Computing
- 2005
The lower bound results identify a time complexity gap between systems where processors may only fail to send messages, and systems where they may fail both to send and to receive messages.
Early delivery totally ordered multicast in asynchronous environments
- Computer ScienceFTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing
- 1993
Experimental results show up to O(log (n)) speedup over previous protocols, which matches the authors' prediction of the expected speedup.
References
SHOWING 1-10 OF 22 REFERENCES
Replicated distributed programs
- PsychologySOSP '85
- 1985
Repl i ca ted Di s t r ibuted P r o g r a m s P r O g rA m s is a good place to start if you want to learn more about how to deal with the aftermath of a natural disaster.
Low cost management of replicated data in fault-tolerant distributed systems
- Computer ScienceTOCS
- 1986
A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated, which results in better response time when performing operations on replicated data.
Replication and fault-tolerance in the ISIS system
- Computer ScienceSOSP '85
- 1985
Techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data are discussed.
An efficient, fault-tolerant protocol for replicated data management
- Computer SciencePODS '85
- 1985
A data management protocol for executing transactions on a replicated database that ensures one-copy serializability and tolerates a large class of failures, including: processor and communication link crashes, partitioning of the communication network, lost messages, and slow responses of processors and communication links.
Programming with Shared Bulletin Boards in Asynchronus Distributed Systems
- Computer Science
- 1986
This paper formalizes the notion of consistent behavior when unreliable processes concurrently access a bulletin board and provides a mechanism for reasoning about consistency in distributed systems, which was previously lacking.
Time, clocks, and the ordering of events in a distributed system
- Computer ScienceCACM
- 1978
A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events, and a bound is derived on how far out of synchrony the clocks can become.
Fail-stop processors: an approach to designing fault-tolerant computing systems
- Computer ScienceTOCS
- 1983
A methodology that facilitates the design of fault-tolerant computing systems is presented. It is based on the notion of a fail-stop processor. Such a processor automatically halts in response to any…
Concurrency Control in Distributed Database Systems
- Computer ScienceCSUR
- 1981
This paper describes a decomposition of the concurrency control problem into two major subproblems: read-write and write-write synchronization, and describes a series of synchromzation techniques for solving each subproblem and how to combine these techniques into algorithms for solving the entire conccurrency control problem.
Reliable broadcast protocols
- Computer Science, BusinessTOCS
- 1984
A reliable broadcast protocol for an unreliable broadcast network is described that isolates the application programs from the unreliable characteristics of the communication network and can be used to simplify distributed database systems and distributed processing algorithms.