OS-level hang detection in complex software systems

  • Authors:
  • Antonio Bovenzi;Marcello Cinque;Domenico Cotroneo;Roberto Natella;Gabriella Carrozza

  • Affiliations:
  • Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;SESM SCARL, Via Circumvallazione Esterna di Napoli, 80014, Giugliano in Campania, Naples, Italy

  • Venue:
  • International Journal of Critical Computer-Based Systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.