A census of Tandem system availability between 1985 and 1990

  title={A census of Tandem system availability between 1985 and 1990},
  author={Jim Gray},
  journal={IEEE Transactions on Reliability},
  • J. Gray
  • Published 1 October 1990
  • Computer Science
  • IEEE Transactions on Reliability
A census of customer outages reported to Tandem showing a clear improvement in the reliability of hardware and maintenance has been taken. It indicates that software is now the major source of reported outages (62%), followed by system operations (15%). This is a dramatic shift from the statistics for 1985. Even after discounting systematic underreporting of operations and environmental outages, the conclusion is clear: hardware faults and hardware maintenance are no longer a major source of… 

Figures and Tables from this paper

Analysis of software halts in the tandem GUARDIAN operating system

  • Inhwan LeeR. Iyer
  • Computer Science
    [1992] Proceedings Third International Symposium on Software Reliability Engineering
  • 1992
The results show that the occurrences of software halts are not correlated with each other in time and fault tolerance in the measured system was shown to reduce the service loss by nearly 90%.

An analysis of client/server outage data

  • A. Wood
  • Computer Science
    Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium
  • 1995
This paper examines client/server outage data and presents a list of outage causes extracted from the data, which include hardware, software, operations, and environmental failures, as well as outages due to planned reconfigurations, to predict availability in a typical client/ server environment and to evaluate various fault-tolerant architectures.

High-availability computer systems

The techniques used to build highly available computer systems are sketched, and the use of pairs of computer systems at separate locations to guard against unscheduled outages due to outside sources (communication or power failures, earthquakes, etc.) is addressed.

Analysis of Preventive Maintenance in Transactions Based Software Systems

An analytical model of a software system which serves transactions is presented and expressions for resulting steady state availability, probability that an arriving transaction is lost and an upper bound on the expected response time of a transition are derived.

Dependability and Performance Measures for the Database Practitioner

We estimate the availability, reliability, and mean transaction time (response time) for repairable database configurations, centralized or distributed, in which each service component is

A study of the reliability of Internet sites

By applying an appropriate test statistic, some samples were found to have a realistic change of being drawn from an exponential distribution, while others can be confidently classed as nonexponential.

Measurement and Analysis of Failures in Computer Systems

A study of software failures spanning several different releases of Tandem's NonStop-UX operating system running on Tandem Integrity S2(TMR) systems, focusing primarily on those TPRs that report a UNIX panic that subsequently crashes the system.

System Support for Software Fault Tolerance in Highly Available Database Management Systems

The dissertation describes modifications to the storage system that improve its performance in environments with high update rates and adds to the fast recovery capabilities of POSTGRES with two techniques for maintaining B-tree index consistency without log processing.

Application of Stochastic Analysis Networks for Space Vehicle Hardware and Software Reliability and Availability Predictions

This paper describes the methodology and results of an integrated hardware and software reliability and availability model of an experimental satellite. The satellite computing architecture is based



Tandem's remote data facility

  • J. Lyon
  • Computer Science
    Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage
  • 1990
RDF allows an organization to maintain a geographically remote backup system with an up-to-date copy of the database that should the first system fail, this second system can rapidly take over the workload, minimizing downtime.

Why Do Computers Stop and What Can Be Done About It?

  • J. Gray
  • Computer Science
    Symposium on Reliability in Distributed Software and Database Systems
  • 1986
It is pointed out that faults in production software are often soft (transient) and that a ransaction mechanism combined with persistent processpairs provides fault-tolerant execution -- the key to software fault -tolerance.

The N-Version Approach to Fault-Tolerant Software

  • A. Avizienis
  • Computer Science
    IEEE Transactions on Software Engineering
  • 1985
Principal requirements for the implementation of N-version software are summarized and the DEDIX distributed supervisor and testbed for the execution of N -version software is described.

Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components

The paper that follows is based on notes taken by Dr. R. S. Pierce on five lectures given by the author at the California Institute of Technology in January 1952, and it is the author's conviction that error should be treated by thermodynamic methods, and be the subject of a thermodynamical theory.

Software Fault Tolerance

The principal models, specification, building, evaluation, and system integration of fault-tolerant software are discussed, and goals for future work are discussed.

Learning from field experience with fault tolerant systems

  • Proc. Int 'I Workrhop Hardware Fault Tolerance in Multiprocessors (at University of Illinois
  • 1989

Dissecting software failures

  • Hewlett-PackardJournul
  • 1989

If the UPS is present but fails, then the UPS failure is the fatal fault


Powering computer-controlled systems: AC or DC?

  • Telesis
  • 1984

Learning from field experience with fault tolerant systems