A methodology for detection and estimation of software aging

@article{Garg1998AMF,
  title={A methodology for detection and estimation of software aging},
  author={Sachin Garg and Aad van Moorsel and Kalyan Vaidyanathan and Kishor S. Trivedi},
  journal={Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257)},
  year={1998},
  pages={283-292}
}
  • S. Garg, A. Moorsel, Kishor S. Trivedi
  • Published 4 November 1998
  • Computer Science
  • Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257)
The phenomenon of software aging refers to the accumulation of errors during the execution of the software which eventually results in it's crash/hang failure. A gradual performance degradation may also accompany software aging. Pro-active fault management techniques such as "software rejuvenation" (Y. Huang et al., 1995) may be used to counteract aging if it exists. We propose a methodology for detection and estimation of aging in the UNIX operating system. First, we present the design and… 
An approach for estimation of software aging in a Web server
TLDR
A methodology based on time series analysis to detect and estimate resource exhaustion times due to software aging in a Web server while subjecting it to an artificial workload is proposed.
An Advanced Methodology for Measuring and Characterizing Software Aging
TLDR
A new metric AS based on the nonlinear trend estimated by Hodrick-Prescott filter to dynamically measure severity of aging is proposed and validated on real aging time series collected from a VOD (video-on-demand) server.
Modeling and analysis of software aging and rejuvenation
TLDR
Stochastic models to evaluate the effectiveness of proactive fault management in operational software systems and determine optimal times to perform rejuvenation, for different scenarios are discussed.
Analysis of Software Aging in a Web Server
TLDR
Based on the models employed here, proactive management techniques like software rejuvenation triggered by actual measurements can be built and how the exploitation of the seasonal variation can help in adequately predicting the future resource usage is shown.
Software Aging Analysis of the Linux Operating System
TLDR
A software aging analysis at the Operating System level, investigating software aging sources inside the Linux kernel is presented, confirming the presence of aging sources in Linux and to relate the observed aging dynamics to the monitored subsystems behaviour.
On the effectiveness of Mann-Kendall test for detection of software aging
TLDR
It is shown that the Mann-Kendall test is highly vulnerable to creating false positives in context of aging detection, and the amount of data considered in the test can be reduced, however, time to detect aging increases considerably.
for Software Rejuvenation
TLDR
This paper describes how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and builds a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system.
Using Accelerated Life Tests to Estimate Time to Software Aging Failure
TLDR
This paper proposes and evaluates the use of quantitative accelerated life tests (QALT) to reduce the time to obtain the lifetime distribution of systems that fail due to software aging, and reduces the time required to obtaining the failure times by a factor of seven.
A comprehensive model for software rejuvenation
TLDR
This paper describes how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and builds a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system.
...
...

References

SHOWING 1-10 OF 22 REFERENCES
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
TLDR
Comparisons to real failure/repair information obtained from field engineers show that, in about 85% of the cases, the error symptoms recognized by this approach correspond to real problems.
Effect of System Workload on Operating System Reliability: A Study on IBM 3081
TLDR
An analysis of operating system failures on an IBM 3081 running VM/SP finds three broad categories of software failures: error handling, program control or logic, and hardware related; it is found that more than 25 percent ofSoftware failures occur in the hardware/software interface.
A case study of Ethernet anomalies in a distributed computing environment
TLDR
In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail.
Identifying software problems using symptoms
  • Inhwan LeeR. IyerA. Metha
  • Computer Science
    Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing
  • 1994
TLDR
Comparisons using the failure, diagnosis, and repair logs in two Tandem system software products show that between 75% and 95% of recurrences can be identified successfully by matching stack traces and symptom strings, indicating that automatic identification ofRecurrences based on their symptoms is possible.
Software defects and their impact on system availability-a study of field failures in operating systems
  • M. SullivanR. Chillarege
  • Business
    [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium
  • 1991
TLDR
It is shown that the impact of an overlay defect is, on average, much higher than that of a regular defect, that boundary conditions and allocation management are the major causes of overlay defects, not timing, and that most overlays are small and corrupt data near the data that the programmer meant to update.
Dependability Measurement and Modeling of a Multicomputer System
TLDR
A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented and shows that errors are highly correlated across machines and across time.
High-availability computer systems
TLDR
The techniques used to build highly available computer systems are sketched, and the use of pairs of computer systems at separate locations to guard against unscheduled outages due to outside sources (communication or power failures, earthquakes, etc.) is addressed.
Estimates of the Regression Coefficient Based on Kendall's Tau
Abstract The least squares estimator of a regression coefficient β is vulnerable to gross errors and the associated confidence interval is, in addition, sensitive to non-normality of the parent
Tcl Extensions for Network Management Applications
TLDR
Extensions to the Tool Command Language (Tcl) are presented that are designed to implement smart network management agents that can receive and execute management scripts provided by other management stations or agents.
Time series analysis, forecasting and control
time series analysis san francisco state university, 6 4 introduction to time series analysis, box and jenkins time series analysis forecasting and, th15 weeks citation classic eugene garfield, proc
...
...