Replication and fault-tolerance in the ISIS system

@inproceedings{Birman1985ReplicationAF,
  title={Replication and fault-tolerance in the ISIS system},
  author={Kenneth P. Birman},
  booktitle={Symposium on Operating Systems Principles},
  year={1985}
}
  • K. Birman
  • Published in
    Symposium on Operating…
    1 December 1985
  • Computer Science
The ISIS system transforms abstract type specifications into fault-tolerant distributed implementations while insulating users from the mechanisms used to achieve fault-tolerance. This paper discusses techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data. The system itself is based on a small set of communication primitives, which are interesting because they achieve high… 

Figures and Tables from this paper

A fault-tolerant server on MACH

Low cost management of replicated data in fault-tolerant distributed systems

A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated, which results in better response time when performing operations on replicated data.

Process-replication technique for fault-tolerance and performance improvement in distributed computing systems

  • Jane-Ferng ChiuGe-Ming Chiu
  • Computer Science
    Proceedings of 3rd IEEE International Symposium on High Performance Distributed Computing
  • 1994
A process-replication protocol which aims at providing fault-tolerance as well as performance improvement to applications such as long-running and real-time tasks by speeding up the determination of message sequences and transmission of outgoing data messages at the expense of small control messages.

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

This survey provides an overview of various fault-tolerance techniques developed to improve the robustness of supercomputing applications in the presence of failures.

Active replication in Delta-4

The authors discuss the coordination of active replicas executing either in a fail-silent host computer environment, or in a full-uncontrolled environment, by a specific protocol, the inter replica protocol (IRp).

Configurable fault-tolerant distributed services

A new model is proposed where a service is composed out of microprotocol objects, each of which implements an individual semantic property of the overall service, making it easy to construct different customized versions of a service with properties tailored to the specifics of an application.

The Delta-4 approach to dependability in open distributed computing systems

The authors present the overall Delta-4 framework for open, fault-tolerant, distributed computing systems and sketch the current implementation, which is based on a local area network with specific atomic multicasting and error-processing protocols for communicating between replicated software components.

Distributed system fault tolerance using message logging and checkpointing

A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods.

An integrated approach to fault tolerance

  • E. ElnozahyW. Zwaenepoel
  • Computer Science, Business
    [1992 Proceedings] Second Workshop on the Management of Replicated Data
  • 1992
Manetho, an experimental protocol system, whose goal is to explore the extent to which transparent fault tolerance can be added to long-running distributed applications, is described, which uses process replication for server processes and rollback-recovery for client processes.

Visual programming of fault-tolerant distributed applications

An approach which combines two environments (SystemSpecs and GARF) so as to graphically design applications using high level Petri nets; and discharge the programmer of fault-tolerance issues is described.
...

References

SHOWING 1-10 OF 23 REFERENCES

Low cost management of replicated data in fault-tolerant distributed systems

A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated, which results in better response time when performing operations on replicated data.

Implementing Fault-Tolerant Distributed Objects

This paper describes a technique for implementing k-resilient objects–distributed objects that remain available, and whose operations are guaranteed to progress to completion, despite up to k site

A message system supporting fault tolerance

A simple and general design uses message-based communication to provide software tolerance of single-point hardware failures. By delivering all interprocess messages to inactive backups for both the

Replicated distributed programs

Repl i ca ted Di s t r ibuted P r o g r a m s P r O g rA m s is a good place to start if you want to learn more about how to deal with the aftermath of a natural disaster.

Fail-stop processors: an approach to designing fault-tolerant computing systems

A methodology that facilitates the design of fault-tolerant computing systems is presented. It is based on the notion of a fail-stop processor. Such a processor automatically halts in response to any

Robustness to Crash in a Distributed Database: A Non Shared-memory Multi-Processor Approach

This paper examines the inadequacy of both the traditional definition of system crash and the conventional approaches to crash recovery for this architecture and describes an approach to recovery from failures which takes advantage of the multiple independent processor memories and avoids system restart in many cases.

A NonStop kernel

Using these primitives, a mechanism that allows fault-tolerant resource access, the process-pair, is described, and some observations are made on this type of system structure and on actual use of the system.

Time, clocks, and the ordering of events in a distributed system

A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events, and a bound is derived on how far out of synchrony the clocks can become.

LOCUS a network transparent, high reliability distributed system

LOCUS is a distributed operating system that provides a very high degree of network transparency while at the same time supporting high performance and automatic replication of storage and Atomic file operations and extensive synchronization are supported.

Determining the last process to fail

Nessary and sufficient conditions are derived here for computing LAST from the local failure data of recovered processes, and these conditions are then translated into procedures for deciding LAST membership, using either complete or incomplete failure data.