The process group approach to reliable distributed computing

  title={The process group approach to reliable distributed computing},
  author={Kenneth P. Birman},
  journal={Commun. ACM},
  • K. Birman
  • Published 1 September 1992
  • Computer Science
  • Commun. ACM
Abstract : The difficulty of developing reliable distributed software is an impediment to applying distributed computing technology in many settings. Experience with the ISIS system suggests that a structured approach based on virtually synchronous process groups yields systems that are substantially easier to develop, exploit sophisticated forms of cooperative computation and achieve high reliability. This paper reviews six years of on ISIS, describing the model, its implementation challenges… 

Figures from this paper

On group communication in large-scale distributed systems

The process group mechanism is considered as an appropriate application structuring paradigm in such large-scale distributed systems and a formal characterization for the attribute "large scale" as applied to distributed systems is given.

A distributed approach to the design of applications

  • D. Johansen
  • Computer Science
    Proceedings of ICCI'93: 5th International Conference on Computing and Information
  • 1993
This work suggests a general design for a certain class of distributed applications monitoring real world events as weather and pollution parameters in StormCast.

Construction and management of highly available services in open distributed systems

A novel replication protocol is presented that satisfies two fundamental requirements of this environment: first, it hides replication from the service clients and secondly, it facilitates the dynamic reconfiguration of the server group.

System Support for Distributed Computing

An experimental toolkit which allows for object-oriented programming of distributed, failure-resilient applications is presented, and it is compared to a compute intensive application implemented on Electra, on PVM, on a transputer, and on two different Linda systems.

Distributed object replication in a cluster of workstations

  • Wanlei ZhouL. Wang
  • Computer Science
    Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region
  • 2000
An object-oriented design pattern is focused on to simplify the design and implementation of distributed replications of reliable and efficient services on a cluster of workstations.

Group-based distributed computing: Programming and distributed platform model.

This chapter describes the subject of the thesis, beginning with the discussion of the background and motivation that led to the research and the material presented in the thesis. The focus of the

An architecture for building reliable distributed object-based systems

  • Li WangWanlei Zhou
  • Computer Science, Engineering
    Proceedings. Technology of Object-Oriented Languages. TOOLS 24 (Cat. No.97TB100240)
  • 1997
The proposed architecture attempts to bring advances in client-server, remote procedure call, reliable group communication, and object orientation technologies under a unified architecture to ease application developers' work of building reliable distributed systems.

Requirements for high performance group support in distributed systems

The authors identify some of the areas in which controversies still exist about the way the group paradigm should actually be implemented and express their views on the matters causing the disagreements.

An Agreement Service for Implementing Fault Tolerant Distributed Software

An agreement service build on top of a group communication layer which allows distributed applications to reach agreement is proposed which facilitates the development of FTDS by implementing agreement protocols transparently to the application programmer.

The group approach in cooperative work and in load balancing

The value of partitioning into groups in two techniques used in distributed systems: cooperative work, a user application and load balancing which is a system application is measured.



Exploiting virtual synchrony in distributed systems

It is argued that this approach to building distributed and fault-tolerant software is more straightforward, more flexible, and more likely to yield correct solutions than alternative approaches.

Using process groups to implement failure detection in asynchronous environments

A rigorous, formal specification for group membership is presented under this interpretation and a solution is presented for this problem as it relates to failure detection in asynchronous, distributed systems.

The ISIS project: real experience with a fault tolerant programming system

The ISIS project has developed a distributed programming toolkit[2,3] and a collection of higher level applications based on these tools. ISIS is now in use at more than 300 locations world-wide.

The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems

This dissertation presents techniques for deciding how strongly ordered a protocol is necessary to solve a given application problem and introduces the concept of a linearization function that maps partially ordered sets of events to totally ordered histories.

An efficient reliable broadcast protocol

This paper presents a (software) protocol that simulates reliable broadcast, even on an unreliable network, and is more efficient than previously published reliable broadcast protocols.

The many faces of consensus in distributed systems

The goal is to give practitioners some sense of the system hardware and software guarantees that are required to achieve a given level of reliability and performance.

Low cost management of replicated data in fault-tolerant distributed systems

A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated, which results in better response time when performing operations on replicated data.

Replication and fault-tolerance in the ISIS system

Techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data are discussed.

Designing application software in wide area network settings

This work considers the design of application software spanning multiple local area environments and presents simple design techniques that yield fault-tolerant wide area programs for important classes of applications.

Tools for distributed application management

It is shown how NuMon, a seismological analysis system for monitoring compliance with nuclear test-ban treaties is managed within the Meta framework, the meta system that solves some longstanding problems of distributed applications.