OS-level hang detection in complex software systems

Authors:
Antonio Bovenzi;Marcello Cinque;Domenico Cotroneo;Roberto Natella;Gabriella Carrozza
Affiliations:
Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.;SESM SCARL, Via Circumvallazione Esterna di Napoli, 80014, Giugliano in Campania, Naples, Italy
Venue:
International Journal of Critical Computer-Based Systems
Year:
2011

Citing 19
Cited 0

Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
Automated support for classifying software failure reports

Proceedings of the 25th International Conference on Software Engineering
Measurement of Failure Rate in Widely Distributed Software

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Data Mining Approaches to Software Fault Diagnosis

RIDE '05 Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications
Construction of a Highly Dependable Operating System

EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Emulation of Software Faults: A Field Data Study and a Practical Approach

IEEE Transactions on Software Engineering
I/O system performance debugging using model-driven anomaly characterization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Checking system rules using system-specific, programmer-written compiler extensions

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Data mining approaches for intrusion detection

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Vigilant: out-of-band detection of failures in virtual machines

ACM SIGOPS Operating Systems Review
Hang analysis: fighting responsiveness bugs

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Exploring recovery from operating system lockups

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Anomaly localization in large-scale clusters

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
An empirical study of reported bugs in server software with implications for automated bug diagnosis

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Adaptive monitoring in microkernel OSs

DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
A sense of self for Unix processes

SP'96 Proceedings of the 1996 IEEE conference on Security and privacy
Operating system support to detect application hangs

VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.