ICSE '94 Proceedings of the 16th international conference on Software engineering
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Modeling and Analysis of Software Aging and Rejuvenation
SS '00 Proceedings of the 33rd Annual Simulation Symposium
Vigilant: out-of-band detection of failures in virtual machines
ACM SIGOPS Operating Systems Review
Automatic software interference detection in parallel applications
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The resiliency challenge presented by soft failure incidents
IBM Systems Journal
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Managing performance of aging applications via synchronized replica rejuvenation
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
Failure prediction based on log files using Random Indexing and Support Vector Machines
Journal of Systems and Software
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990's the U.S. Dept.of Energy and NASA funded development of an advanced statistical pattern recognition method called the Multivariate State Estimation Technique (MSET) for proactive online detection of dynamic sensor and signal anomalies in nuclear power plants and Space Shuttle Main Engine telemetry data. The present investigation was undertaken to investigatethe feasibility and practicability of applying MSET for realtime proactive detection of software aging mechanisms in complex, multi-CPU servers. The procedure uses MSET for model based parameter estimation in conjunction with statistical fault detection and Bayesian fault decision processing. A realtime software telemetry harness was designed to continuously sample over 50 performance metrics related to computer system load, throughput, queue lengths, and transaction latencies. A series of fault injection experiments was conducted using a "memory leak" injector tool with controllable parasitic resource consumption rates. MSET was able to reliably detect the onset of resource contention problems with high sensitivity and excellent false-alarm avoidance. Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described.