Reliability Issues in Computing System Design

@article{Randell1978ReliabilityII,
  title={Reliability Issues in Computing System Design},
  author={Brian Randell and P. A. Lee and Philip C. Treleaven},
  journal={ACM Comput. Surv.},
  year={1978},
  volume={10},
  pages={123-165}
}
This paper surveys the various problems involved in achieving very high rehability from complex computing systems, and discusses the relatmnship between system structurmg techniques and techniques of fault tolerance. Topics covered mclude: 1) protective redundancy in hardware and software; 2) the use of atomic actmns to structure the activity of a system to limit mformatmn flow; 3) error detection techniques; 4) strategies for locating and dealmg with faults and for assessing the damage they… 
Software reliability in real-time systems
TLDR
This paper investigates techniques to enhance the continuity of operations of the enroute air traffic control system by studying issues of software reliability and fault tolerance in realtime systems and four architectures of recovery block scheme.
Design of Multilevel Fault Tolerant Systems
TLDR
The aim of this paper is to provide a structured methodology and several implementation suggestions for the design of failure tolerant systems that does not explicitely differentiate between hard and soft objects and allows the treatment of failure tolerance from the first design phases.
High-availability computer systems
TLDR
The techniques used to build highly available computer systems are sketched, and the use of pairs of computer systems at separate locations to guard against unscheduled outages due to outside sources (communication or power failures, earthquakes, etc.) is addressed.
Some Critical Comments on the Paper "An Optimal Approach to Fault Tolerant Software Systems Design" by Gannon and Shapiro
TLDR
Fault tolerance in software systems is becoming increasingly important for the attainment of high reliability in computing systems, and the above paper was somewhat concerned about some of the concepts expressed, mainly in the first two sections of the paper, which feel give a misleading impression of certain aspects of fault tolerance.
The design description language of MARPLE: reliability analysis during system design
  • M. Mulazzani
  • Computer Science
    Proceedings of the Twenty-Fourth Annual Hawaii International Conference on System Sciences
  • 1991
TLDR
This paper concentrates on MARPLE's interface with design systems, i.e. the design description language DDL, and on the reliability aspects relevant during the design process.
DEPENDABLE COMPUTING CONCEPTS AND AND FAULT TOLERANCE : TERMINOLOGY
This paper provides a conceptual framework for expressing the attributes of what constitutes dependable and reliable computing-the impairments to dependability faults, errors, and failures,-the means
Fault-tolerant computing concepts for aerospace applications—a survey
  • A. Pedar, V. Sarma
  • Computer Science
    Proceedings of the Indian Academy of Sciences Section C: Engineering Sciences
  • 1980
TLDR
The new concept of performability, which combines both the performance and the reliability of a system, and the configuration optimisation of a gracefully degradable computing system are discussed.
Model-Based Analysis and Development of Dependable Systems
TLDR
This chapter concentrates on safety and reliability aspects and starts with a review of the basic terminology including, for example, fault, failure, availability, and integrity.
A Framework for Software Fault Tolerance in Real-Time Systems
TLDR
This work proposes a straightforward pragmatic approach to software fault tolerance which takes advantage of the structure of real-time systems to simplify error recovery, and a classification scheme for errors is introduced.
The simulation of a fault tolerant computer system
TLDR
The properties of a fault tolerant computer system based on a hexagonal grid of processing elements (called the FMPA system) is investigated through discrete event simulation, which is remarkably robust, and even seems to perform better in the face of moderate component failure.
...
...

References

SHOWING 1-10 OF 100 REFERENCES
Reliable Computing Systems
  • B. Randell
  • Computer Science
    Advanced Course: Operating Systems
  • 1978
TLDR
An analysis of the various problems involved in achieving very high reliability from complex computing systems is presented, and the relationship between system structuring techniques and techniques of fault tolerance is discussed.
The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design
TLDR
The following aspects of the STAR system are described: architecture, reliability analysis, software, automatic maintenance of peripheral systems, and adaptation to serve as the central computer of an outerplanet exploration spacecraft.
Reliable hardware-software architecture
  • W. Wulf
  • Computer Science
    Reliable Software
  • 1975
TLDR
The author's design philosophy aims at keeping the system operational even though the underlying hardware may be malfunctioning, which is essentially an extension of the 'modular' programming methodology, advocated by Parnas and others, to include dynamic error detection and recovery.
Fault-tolerance experiments with the JPL STAR computer.
Results of fault-tolerance experiments performed using an experimental computer with dynamic (standby) redundancy, including replaceable subsystems and a 'program rollback' provision to eliminate
Recovery blocks in action: A system supporting high reliability
TLDR
A brief account is presented of the recovery block scheme, together with a description of a new implementation of the underlying cache mechanism, which incorporates this implementation and also provides a high level of detection for errors such as the corruption of code and data.
Reliable hardware/software architecture
  • W. Wulf
  • Computer Science
    IEEE Transactions on Software Engineering
  • 1975
TLDR
The paper focuses on the design philosophy which aims at keeping the system operational even though the underlying hardware may be malfunctioning, which is essentially an extension of the `modular' programming methodology to include dynamic error detection and recovery.
System structure for software fault tolerance
  • B. Randell
  • Computer Science
    IEEE Transactions on Software Engineering
  • 1975
TLDR
The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
Fault Tolerant Operating Systems
TLDR
This paper develops four related architectural principles which can guide the construction of error-tolerant operating systems and implements of these principles are given for process management, interrupts and traps, store access through capabilities, protected procedure entry, and tagged architecture.
Software reliability: The role of programmed exception handling
TLDR
It is shown, using an example program, how exception handling can be combined with the recovery block structure to improve the effectiveness with which problems due to anticipated faulty input data, hardware components, etc., are dealt with, while continuing to provide means for recovering from unanticipated faults.
A Study of Fault-Tolerant Computing
TLDR
The report presents the results of a study of fault-tolerant computing, evaluating existing and new architectural techniques for use in cost-effective systems attaining desired measures of correctness, availability and recovery.
...
...