Proactive detection of software aging mechanisms in performance critical computers
@article{Gross2002ProactiveDO, title={Proactive detection of software aging mechanisms in performance critical computers}, author={Kenny C. Gross and Vivek Bhardwaj and Randall L. Bickford}, journal={27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings.}, year={2002}, pages={17-23} }
Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990 's the U.S. Dept. of Energy and NASA funded…
48 Citations
CHAOS: Accurate and Realtime Detection of Aging-Oriented Failure Using Entropy
- Computer ScienceArXiv
- 2015
A novel entropy-based aging indicator, Multidimensional Multi-scale Entropy (MMSE), which employs the complexity embedded in runtime performance metrics to indicate software aging and leverages multi-scale and multi-dimension integration to tolerate system fluctuations.
ROBUST AND ADAPTIVE MODELING OF SOFTWARE AGING
- Computer Science
- 2007
The primary method to fight aging is software rejuvenation, i.e. a restart of the aging application periodically or adaptively, which has obvious advantages over the periodical rejuvenation schema but requires aging models which are able to predict the expected performance at least over a part of the rejuvenation cycle.
Using machine learning for non-intrusive modeling and prediction of software aging
- Computer ScienceNOMS 2008 - 2008 IEEE Network Operations and Management Symposium
- 2008
A method for monitoring and modeling of performance degradation in SOA applications, particularly application servers, and several state-of-the-art classification methods are evaluated to make the measurement-based aging models more adaptive and more robust against transient failures.
Software error early detection system based on run-time statistical analysis of function return values
- Computer Science
- 2006
The experimental results indicate that the proposed statistical method can be effective in identifying problems early on, potentially allowing for defensive measures and the overhead is negligible at less than 1%.
Seer: A Lightweight Online Failure Prediction Approach
- Computer Science2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC)
- 2017
A lightweight online failure prediction approach, called Seer, to predict the manifestation of failures at runtime, i.e., while the system is running and before the failures occur, so that preventive and/or protective measures can be taken to improve software reliability.
The resiliency challenge presented by soft failure incidents
- Computer ScienceIBM Syst. J.
- 2008
This paper proposes a new method for solutions deployed on IBM z/OS™ systems to respond when either the system or the application stops running, and uses machine learning and mathematical modeling to identify normal behavior, enabling the detection of abnormal behavior before it impacts the customer.
Seer: A Lightweight Online Failure Prediction Approach
- Computer ScienceIEEE Transactions on Software Engineering
- 2016
It is conjecture that large cost reductions in collecting internal execution data for online failure prediction may derive from pushing the substantial parts of the data collection work onto the hardware.
Automatic software interference detection in parallel applications
- Computer ScienceProceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)
- 2007
An automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications by measuring the relative timing of application events rather than system level events such as CPU utilization, which lets the system automatically accommodate natural variations in an application's utilization of resources.
SimSPRT-II: Monte Carlo Simulation of Sequential Probability Ratio Test Algorithms for Optimal Prognostic Performance
- Computer Science2016 International Conference on Computational Science and Computational Intelligence (CSCI)
- 2016
SimSPRT-II is a comprehensive parametric monte-carlo simulation framework for tuning, optimization, and performance evaluation of SPRT based AI algorithms for applications in a broad range of engineering and security prognostic applications.
Towards Assessing Representativeness of Fault Injection-Generated Failure Data for Online Failure Prediction
- Computer Science2015 IEEE International Conference on Dependable Systems and Networks Workshops
- 2015
This work presents a preliminary study towards the assessment the representativeness of failure-related data by using G-SWFIT realistic software fault injection technique, and addresses the definition of concepts and metrics for the representsativeness estimation and assessment.
References
SHOWING 1-10 OF 25 REFERENCES
Modeling and analysis of software aging and rejuvenation
- Computer ScienceProceedings 33rd Annual Simulation Symposium (SS 2000)
- 2000
Stochastic models to evaluate the effectiveness of proactive fault management in operational software systems and determine optimal times to perform rejuvenation, for different scenarios are discussed.
Analysis and implementation of software rejuvenation in cluster systems
- Computer ScienceSIGMETRICS '01
- 2001
This paper discusses software rejuvenation as applied to cluster systems using Stochastic Reward Nets (SRNs) and determines the optimal rejuvenation interval based on system availability and cost, and introduces a new rejuvenation policy based on prediction that can dramatically increase systemavailability and reduce downtime cost.
DEVELOPMENT OF AN ONLINE PREDICTIVE MONITORING SYSTEM FOR POWER GENERATING PLANTS
- Engineering
- 2002
To assure the continued safe, reliable and efficient operation of a nuclear power plant, it is essential that accurate online measurement information is available to the plant operators, engineering…
Online Signal Validation for Assured Data Quality
- Engineering
- 2001
A new procedure to automatically validate and assess the quality of online signal data is presented and uses model based parameter estimation in conjunction with statistical fault detection and Bayesian fault diagnosis.
On-board preventive maintenance for long-life deep-space missions: a model-based analysis
- Computer ScienceProceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248)
- 1998
An approach to on-board preventive maintenance which rejuvenates a system via periodical duty switching between system components, slowing down a system's aging process and enhancing mission reliability is presented.
Estimating Software Rejuvenation Schedules in High-Assurance Systems
- Computer ScienceComput. J.
- 2001
The classical result by Huang et al. (1995) is extended, and a modified stochastic model is proposed to generate the software rejuvenation schedule, which is formulated via the semi-Markov reward process and derived analytically in terms of the reward rate.
Application of a model-based fault detection system to nuclear plant signals
- Engineering
- 1997
To assure the continued safe and reliable operation of a nuclear power station, it is essential that accurate online information on the current state of the entire system be available to the…
Model-based nuclear power plant monitoring and fault detection: Theoretical foundations
- Engineering
- 1997
The theoretical basis and validation studies of a real-time, model-based process monitoring and fault detection system is presented. Through use of a non-linear state estimation technique coupled…
Software aging
- Computer ScienceProceedings of 16th International Conference on Software Engineering
- 1994
A sign that the software engineering profession has matured will be that researchers and practitioners lose their preoccupation with the first release and focus on the long-term health of the products.
Sequential probability ratio tests for reactor signal validation and sensor surveillance applications
- Engineering
- 1989
The properties of sequential probability ratio tests (SPRT's) are investigated, and theoretical results are validated with tests that use DN-signal data taken from the EBR-II in Idaho.