Learn More
Flaviu Cristian is a computer scientist at the IBM Al-After carrying out research in operating systems and programming methodology in France, and working on the specification, design, and verification of fault-tolerant programs in England, he joined IBM in 1982. Since then he has worked in the area of fault-tolerant distributed protocols and systems. He has(More)
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we nd useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give(More)
The rst part of this paper provides rigorous deenitions for several basic concepts underlying the design of dependable programs, such as speciication, program semantics, exception, program correctness, robustness, failure, fault, and error. The second part investigates what it means to handle exceptions in modular programs structured as hierarchies of data(More)
In loosely coupled distributed systems subject to random communication delays and component failures, atomic brocrdcart protocols can be used to implement the abstraction of a A-common sfomge, a replicated storage that displays at any clock time the same contents to every correct processor and that requires A time units to complete replicated updates. We(More)
We i n troduce the timed asynchronous distributed system model to describe existing asynchronous distributed systems subject to unbounded processing and communication delays, failures and recoveries. We then describe ve increasingly strong speciica-tions for processor-group membership services in timed asynchronous systems subject to partitioning. We also(More)
We present D<sc>ATUM</sc>, a novel method for tolerating multiple disk failures in disk arrays. D<sc>ATUM</sc> is the first known method that can mask any given number of failures, requires an optimal amount of redundant storage space, and spreads reconstruction accesses uniformly over disks in the presence of failures without needing large layout tables in(More)
Atomic broadcast ensures that concurrent updates to the state of a process group are consistently delivered to all group members despite random communication delays and failures. By relieving replicated application programmers from the burden of dealing with the diicult issue of maintaining replica state consistency, atomic broadcast is a fundamental(More)
Fortress is a support system for designing and implementing fault-tolerant distributed real-time systems that use commercial of the shelf (COTS) components. The main problem we address in Fortress is that services cannot always provide their standard properties due the possibility of missed deadlines, dropped messages and process crashes. Fortress allows(More)
Reaching agreement on the identity of correctly functioning processors of a distributed system in the presence of random communication delays, failures and processor joins is a fundamental problem in fault-tolerant distributed systems. Assuming a synchronous communication network that is not subject to partition occurrences, we specify the processor-group(More)