Proactive Detection of Software Aging Mechanisms in Performance Critical Computers

  • Authors:
  • Kenny C. Gross;Vatsal Bhardwaj;Randy Bickford

  • Affiliations:
  • -;-;-

  • Venue:
  • SEW '02 Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02)
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990's the U.S. Dept.of Energy and NASA funded development of an advanced statistical pattern recognition method called the Multivariate State Estimation Technique (MSET) for proactive online detection of dynamic sensor and signal anomalies in nuclear power plants and Space Shuttle Main Engine telemetry data. The present investigation was undertaken to investigatethe feasibility and practicability of applying MSET for realtime proactive detection of software aging mechanisms in complex, multi-CPU servers. The procedure uses MSET for model based parameter estimation in conjunction with statistical fault detection and Bayesian fault decision processing. A realtime software telemetry harness was designed to continuously sample over 50 performance metrics related to computer system load, throughput, queue lengths, and transaction latencies. A series of fault injection experiments was conducted using a "memory leak" injector tool with controllable parasitic resource consumption rates. MSET was able to reliably detect the onset of resource contention problems with high sensitivity and excellent false-alarm avoidance. Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described.