Replication and fault-tolerance in the ISIS system
@inproceedings{Birman1985ReplicationAF, title={Replication and fault-tolerance in the ISIS system}, author={Kenneth P. Birman}, booktitle={Symposium on Operating Systems Principles}, year={1985} }
The ISIS system transforms abstract type specifications into fault-tolerant distributed implementations while insulating users from the mechanisms used to achieve fault-tolerance. This paper discusses techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data. The system itself is based on a small set of communication primitives, which are interesting because they achieve high…
330 Citations
Low cost management of replicated data in fault-tolerant distributed systems
- Computer ScienceTOCS
- 1986
A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated, which results in better response time when performing operations on replicated data.
Process-replication technique for fault-tolerance and performance improvement in distributed computing systems
- Computer ScienceProceedings of 3rd IEEE International Symposium on High Performance Distributed Computing
- 1994
A process-replication protocol which aims at providing fault-tolerance as well as performance improvement to applications such as long-running and real-time tasks by speeding up the determination of message sequences and transmission of outgoing data messages at the expense of small control messages.
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
- Computer ScienceArXiv
- 2005
This survey provides an overview of various fault-tolerance techniques developed to improve the robustness of supercomputing applications in the presence of failures.
Active replication in Delta-4
- Computer Science[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing
- 1992
The authors discuss the coordination of active replicas executing either in a fail-silent host computer environment, or in a full-uncontrolled environment, by a specific protocol, the inter replica protocol (IRp).
Configurable fault-tolerant distributed services
- Computer Science
- 1996
A new model is proposed where a service is composed out of microprotocol objects, each of which implements an individual semantic property of the overall service, making it easy to construct different customized versions of a service with properties tailored to the specifics of an application.
The Delta-4 approach to dependability in open distributed computing systems
- Computer Science[1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers
- 1988
The authors present the overall Delta-4 framework for open, fault-tolerant, distributed computing systems and sketch the current implementation, which is based on a local area network with specific atomic multicasting and error-processing protocols for communicating between replicated software components.
Distributed system fault tolerance using message logging and checkpointing
- Computer Science
- 1990
A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods.
An integrated approach to fault tolerance
- Computer Science, Business[1992 Proceedings] Second Workshop on the Management of Replicated Data
- 1992
Manetho, an experimental protocol system, whose goal is to explore the extent to which transparent fault tolerance can be added to long-running distributed applications, is described, which uses process replication for server processes and rollback-recovery for client processes.
Visual programming of fault-tolerant distributed applications
- Computer ScienceProceedings of Symposium on Visual Languages
- 1995
An approach which combines two environments (SystemSpecs and GARF) so as to graphically design applications using high level Petri nets; and discharge the programmer of fault-tolerance issues is described.
References
SHOWING 1-10 OF 23 REFERENCES
Low cost management of replicated data in fault-tolerant distributed systems
- Computer ScienceTOCS
- 1986
A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated, which results in better response time when performing operations on replicated data.
Implementing Fault-Tolerant Distributed Objects
- Computer ScienceIEEE Transactions on Software Engineering
- 1985
This paper describes a technique for implementing k-resilient objects–distributed objects that remain available, and whose operations are guaranteed to progress to completion, despite up to k site…
A message system supporting fault tolerance
- Computer ScienceSOSP '83
- 1983
A simple and general design uses message-based communication to provide software tolerance of single-point hardware failures. By delivering all interprocess messages to inactive backups for both the…
Replicated distributed programs
- PsychologySOSP '85
- 1985
Repl i ca ted Di s t r ibuted P r o g r a m s P r O g rA m s is a good place to start if you want to learn more about how to deal with the aftermath of a natural disaster.
Fail-stop processors: an approach to designing fault-tolerant computing systems
- Computer ScienceTOCS
- 1983
A methodology that facilitates the design of fault-tolerant computing systems is presented. It is based on the notion of a fail-stop processor. Such a processor automatically halts in response to any…
Robustness to Crash in a Distributed Database: A Non Shared-memory Multi-Processor Approach
- Computer ScienceVLDB
- 1984
This paper examines the inadequacy of both the traditional definition of system crash and the conventional approaches to crash recovery for this architecture and describes an approach to recovery from failures which takes advantage of the multiple independent processor memories and avoids system restart in many cases.
A NonStop kernel
- Computer ScienceSOSP
- 1981
Using these primitives, a mechanism that allows fault-tolerant resource access, the process-pair, is described, and some observations are made on this type of system structure and on actual use of the system.
Time, clocks, and the ordering of events in a distributed system
- Computer ScienceCACM
- 1978
A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events, and a bound is derived on how far out of synchrony the clocks can become.
LOCUS a network transparent, high reliability distributed system
- Computer ScienceSOSP
- 1981
LOCUS is a distributed operating system that provides a very high degree of network transparency while at the same time supporting high performance and automatic replication of storage and Atomic file operations and extensive synchronization are supported.
Determining the last process to fail
- Computer ScienceTOCS
- 1985
Nessary and sufficient conditions are derived here for computing LAST from the local failure data of recovered processes, and these conditions are then translated into procedures for deciding LAST membership, using either complete or incomplete failure data.