Designing for Ultrahigh Availability: The Unix RTR Operating System

@article{Wallace1984DesigningFU,
  title={Designing for Ultrahigh Availability: The Unix RTR Operating System},
  author={John J. Wallace and Walter W. Barnes},
  journal={Computer},
  year={1984},
  volume={17},
  pages={31-39}
}
Early designers of highly available computers concentrated on recovery from hardware failures to keep downtime to a minimum. But as hardware became more reliable, and systems (particularly software) became more complex, the percentage of downtime caused by hardware decreased. Achieving ultrahigh availability-on the order of a few minutes of downtime per year-requires far more than just reliable hardware. This can be seen from Table 1, which gives causes of downtime for both electronic switching… 

Figures from this paper

X-Ware Reliability and Availability Modeling
TLDR
It is shown that there is no theoretical impediment to deriving classical reliability models, and that the classical reliability theory can be generalized in order to cover both hardware and software viewpoints that are X-Ware.
Architecture of fault-tolerant computers: an historical perspective
The author surveys the approaches and techniques used to improve system reliability. Over the past 40 years computing systems have experienced over three orders of magnitude improvement in average
Niche successes to ubiquitous invisibility: fault-tolerant computing past, present, and future
TLDR
The rise of fault tolerance from the early dedicated service providers through today's commodity markets is traced and barriers to the use of faultolerance in commodity markets are identified.
Transaction processing monitors
TLDR
This article describes one efficient method for estimating gradients in the Monte Carlo setting, namely the likelihood ratio method (also known as the efficient score method), and derives likelihood-ratio-gradient estimators for both time-homogeneous and non-time homogeneous discrete-time Markov chains.
Fault-Tolerant Computing
Software reliability analysis of three successive generations of a telecommunications system
  • M. Kaâniche, K. Kanoun
  • Computer Science
    Proceedings. 1998 IEEE Workshop on Application-Specific Software Engineering and Technology. ASSET-98 (Cat. No.98EX183)
  • 1998
TLDR
Analysis of the data collected on the software of three successive generations of the Brazilian switching system TROPICO-R during validation and operation addresses the modifications introduced on system components, the distribution of failures and corrected faults in the components, and the functions fulfilled by the system.
Software Failure Data Analysis of two Successive Generations of a Switching System
TLDR
This paper analyzes the failure data of two successive products of a software switching system during validation and operation to determine the evolution of the failure intensity functions.
Implementation of a module implementor for an activity based distributed system
TLDR
To investigate the feasibility of this environment, a fully functional module implementor was created and other components with limited functionality were coded and an assembly line was simulated within the context of these components.
An annotated bibliography of dependable distributed computing
TLDR
This report was prepared as part of a Summer Faculty Research Program associateship sponsored by Rome Laboratory of the U.S. Air Force Systems Command.
...
...

References

SHOWING 1-10 OF 20 REFERENCES
1a processor: Maintenance software
TLDR
Results of extensive laboratory testing and early field experience indicate that the maintenance objectives will be achieved despite the size and complexity of the 1A Processor.
Fault-tolerant design of local ESS processors
  • W. N. Toy
  • Computer Science
    Proceedings of the IEEE
  • 1978
TLDR
Pertinent processor architecture features used to achieve ESS reliablity objectives are discussed and a detailed discussion of the maintenance design of the 3A Processor is also included.
Software reliability guidebook
Maintenance Software
  • Bell System Technical J
  • 1977
The 3B20D Processor and DMERT Operating System
  • Bell System Technical J
  • 1983
3B20D Computer System Field Experience-The First 250
  • 3B20D Computer System Field Experience-The First 250
Attaining Objectives in a High-Availability System
  • Third Annual Int'l Phoenix Conf. Computers and Communications
  • 1984
DMERT: A Fault-Tolerant Environment for Diverse Applications," Proc
  • 14th Int'l Conf. Fault-Tolerant Computing, June
  • 1984
Delessio, "3B20D Computer System Field Experience-The
  • Proc. Eleventh Int'l Switching Symp.,
  • 1984
...
...