Proactive detection of software aging mechanisms in performance critical computers

  title={Proactive detection of software aging mechanisms in performance critical computers},
  author={Kenny C. Gross and Vivek Bhardwaj and Randall L. Bickford},
  journal={27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings.},
  • K. Gross, V. Bhardwaj, R. Bickford
  • Published 5 December 2002
  • Computer Science
  • 27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings.
Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990 's the U.S. Dept. of Energy and NASA funded… 

Figures from this paper

CHAOS: Accurate and Realtime Detection of Aging-Oriented Failure Using Entropy
A novel entropy-based aging indicator, Multidimensional Multi-scale Entropy (MMSE), which employs the complexity embedded in runtime performance metrics to indicate software aging and leverages multi-scale and multi-dimension integration to tolerate system fluctuations.
The primary method to fight aging is software rejuvenation, i.e. a restart of the aging application periodically or adaptively, which has obvious advantages over the periodical rejuvenation schema but requires aging models which are able to predict the expected performance at least over a part of the rejuvenation cycle.
Using machine learning for non-intrusive modeling and prediction of software aging
  • A. AndrzejakL. Silva
  • Computer Science
    NOMS 2008 - 2008 IEEE Network Operations and Management Symposium
  • 2008
A method for monitoring and modeling of performance degradation in SOA applications, particularly application servers, and several state-of-the-art classification methods are evaluated to make the measurement-based aging models more adaptive and more robust against transient failures.
Software error early detection system based on run-time statistical analysis of function return values
The experimental results indicate that the proposed statistical method can be effective in identifying problems early on, potentially allowing for defensive measures and the overhead is negligible at less than 1%.
Seer: A Lightweight Online Failure Prediction Approach
A lightweight online failure prediction approach, called Seer, to predict the manifestation of failures at runtime, i.e., while the system is running and before the failures occur, so that preventive and/or protective measures can be taken to improve software reliability.
The resiliency challenge presented by soft failure incidents
This paper proposes a new method for solutions deployed on IBM z/OS™ systems to respond when either the system or the application stops running, and uses machine learning and mathematical modeling to identify normal behavior, enabling the detection of abnormal behavior before it impacts the customer.
Seer: A Lightweight Online Failure Prediction Approach
It is conjecture that large cost reductions in collecting internal execution data for online failure prediction may derive from pushing the substantial parts of the data collection work onto the hardware.
Automatic software interference detection in parallel applications
An automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications by measuring the relative timing of application events rather than system level events such as CPU utilization, which lets the system automatically accommodate natural variations in an application's utilization of resources.
SimSPRT-II: Monte Carlo Simulation of Sequential Probability Ratio Test Algorithms for Optimal Prognostic Performance
  • Tahereh MasoumiK. Gross
  • Computer Science
    2016 International Conference on Computational Science and Computational Intelligence (CSCI)
  • 2016
SimSPRT-II is a comprehensive parametric monte-carlo simulation framework for tuning, optimization, and performance evaluation of SPRT based AI algorithms for applications in a broad range of engineering and security prognostic applications.
Towards Assessing Representativeness of Fault Injection-Generated Failure Data for Online Failure Prediction
  • Ivano IrreraM. Vieira
  • Computer Science
    2015 IEEE International Conference on Dependable Systems and Networks Workshops
  • 2015
This work presents a preliminary study towards the assessment the representativeness of failure-related data by using G-SWFIT realistic software fault injection technique, and addresses the definition of concepts and metrics for the representsativeness estimation and assessment.


Modeling and analysis of software aging and rejuvenation
Stochastic models to evaluate the effectiveness of proactive fault management in operational software systems and determine optimal times to perform rejuvenation, for different scenarios are discussed.
Analysis and implementation of software rejuvenation in cluster systems
This paper discusses software rejuvenation as applied to cluster systems using Stochastic Reward Nets (SRNs) and determines the optimal rejuvenation interval based on system availability and cost, and introduces a new rejuvenation policy based on prediction that can dramatically increase systemavailability and reduce downtime cost.
To assure the continued safe, reliable and efficient operation of a nuclear power plant, it is essential that accurate online measurement information is available to the plant operators, engineering
Online Signal Validation for Assured Data Quality
A new procedure to automatically validate and assess the quality of online signal data is presented and uses model based parameter estimation in conjunction with statistical fault detection and Bayesian fault diagnosis.
On-board preventive maintenance for long-life deep-space missions: a model-based analysis
  • A. TaiL. AlkalaiS. Chau
  • Computer Science
    Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248)
  • 1998
An approach to on-board preventive maintenance which rejuvenates a system via periodical duty switching between system components, slowing down a system's aging process and enhancing mission reliability is presented.
Estimating Software Rejuvenation Schedules in High-Assurance Systems
The classical result by Huang et al. (1995) is extended, and a modified stochastic model is proposed to generate the software rejuvenation schedule, which is formulated via the semi-Markov reward process and derived analytically in terms of the reward rate.
Application of a model-based fault detection system to nuclear plant signals
To assure the continued safe and reliable operation of a nuclear power station, it is essential that accurate online information on the current state of the entire system be available to the
Model-based nuclear power plant monitoring and fault detection: Theoretical foundations
The theoretical basis and validation studies of a real-time, model-based process monitoring and fault detection system is presented. Through use of a non-linear state estimation technique coupled
Software aging
  • D. Parnas
  • Computer Science
    Proceedings of 16th International Conference on Software Engineering
  • 1994
A sign that the software engineering profession has matured will be that researchers and practitioners lose their preoccupation with the first release and focus on the long-term health of the products.
Sequential probability ratio tests for reactor signal validation and sensor surveillance applications
The properties of sequential probability ratio tests (SPRT's) are investigated, and theoretical results are validated with tests that use DN-signal data taken from the EBR-II in Idaho.