Lazy Repair for Addition of Fault-Tolerance to Distributed Programs
Late detection of new types of faults often results in the evolution of faulttolerance requirements while developers have already created design artifacts. Thus, the reuse of an existing design in the development of a fault-tolerant version thereof has the potential to reduce the overall development costs. Moreover, the automation of such a reuse yields a fault-tolerant design that is correct by construction, given that the existing design is correct. To facilitate such an automation, we present an approach, where we add three levels of faulttolerance, namely failsafe, nonmasking, and masking, to functional designs represented as state machines. Intuitively, failsafe fault-tolerance requires that safety specification is met even in the presence of faults. In the presence of faults, nonmasking fault-tolerance guarantees recovery to states from where safety and liveness specifications are satisfied. Masking fault-tolerance stipulates that (i) recovery is provided to states from where safety and liveness specifications are met, and (ii) safety specification is satisfied during such a recovery. Specifically, we present sound and complete deterministic algorithms for automated addition of (failsafe/nonmasking/masking) fault-tolerance to the functional design of concurrent programs. These polynomial-time algorithms are especially useful in model-driven development of fault-tolerant systems, where models are automatically checked and modified. We also discuss (1) the effect of distribution and safety specification model on the complexity of adding fault-tolerance, and (2) the impact of the proposed algorithms on the addition of multitolerance.