Learn More
A probabilistic method is proposed for reading remote clocks in distributed systems subject to unbounded random communication delays. The method can achieve clock synchronization precisions superior to those attainable by previously published clock synchronization algorithms. Its use is illustrated by presenting a time service which maintains externally(More)
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we nd useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give(More)
In loosely coupled distributed systems subject to random communication delays and component failures, atomic brocrdcart protocols can be used to implement the abstraction of a A-common sfomge, a replicated storage that displays at any clock time the same contents to every correct processor and that requires A time units to complete replicated updates. We(More)
The rst part of this paper provides rigorous deenitions for several basic concepts underlying the design of dependable programs, such as speciication, program semantics, exception, program correctness, robustness, failure, fault, and error. The second part investigates what it means to handle exceptions in modular programs structured as hierarchies of data(More)
Reaching agreement on the identity of correctly functioning processors of a distributed system in the presence of random communication delays, failures and processor joins is a fundamental problem in fault-tolerant distributed systems. Assuming a synchronous communication network that is not subject to partition occurrences, we specify the processor-group(More)
We i n troduce the timed asynchronous distributed system model to describe existing asynchronous distributed systems subject to unbounded processing and communication delays, failures and recoveries. We then describe ve increasingly strong speciica-tions for processor-group membership services in timed asynchronous systems subject to partitioning. We also(More)
bounded responses with a certain probability. This article emphasized similarities between synchronous and asynchronous programming by discussing only strict agreement-the kind of asynchronous agreement closest to synchronous agreement. In reality, the field of asynchronous group communication is vaster-strict agreement being one extreme where all replicas(More)
We present D<sc>ATUM</sc>, a novel method for tolerating multiple disk failures in disk arrays. D<sc>ATUM</sc> is the first known method that can mask any given number of failures, requires an optimal amount of redundant storage space, and spreads reconstruction accesses uniformly over disks in the presence of failures without needing large layout tables in(More)
Atomic broadcast ensures that concurrent updates to the state of a process group are consistently delivered to all group members despite random communication delays and failures. By relieving replicated application programmers from the burden of dealing with the diicult issue of maintaining replica state consistency, atomic broadcast is a fundamental(More)