Operating system support to detect application hangs

Authors:
G. Carrozza;M. Cinque;D. Cotroneo;R. Natella
Affiliations:
Dipartimento di Informatica e Sistemistica, Universit à degli Studi di Napoli Federico II, Naples, Italy;Dipartimento di Informatica e Sistemistica, Universit à degli Studi di Napoli Federico II, Naples, Italy;Dipartimento di Informatica e Sistemistica, Universit à degli Studi di Napoli Federico II, Naples, Italy;Laboratorio CINI ITEM, Complesso Univ. M. S. Angelo, Naples, Italy
Venue:
VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
Year:
2008

Citing 13
Cited 3

Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
Two techniques for transient software error recovery

Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
Event Log based Dependability Analysis of Windows NT and 2K Systems

PRDC '02 Proceedings of the 2002 Pacific Rim International Symposium on Dependable Computing
Measurement of Failure Rate in Widely Distributed Software

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Data Mining Approaches to Software Fault Diagnosis

RIDE '05 Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications
Construction of a Highly Dependable Operating System

EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Emulation of Software Faults: A Field Data Study and a Practical Approach

IEEE Transactions on Software Engineering
Data mining approaches for intrusion detection

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
An Approach to Concurrent Control Flow Checking

IEEE Transactions on Software Engineering
A sense of self for Unix processes

SP'96 Proceedings of the 1996 IEEE conference on Security and privacy

Error detection framework for complex software systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
OS-level hang detection in complex software systems

International Journal of Critical Computer-Based Systems
A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security

Quantified Score

Hi-index	0.00

Visualization

Abstract

On-line failure detection is an essential means to control and assess the dependability of complex and critical software systems. In such context, effective detection strategies are required, in order to minimize the possibility of catastrophic consequences. This objective is however difficult to achieve in complex systems, especially due to the several sources of non-determinism (e.g., multi-threading and distributed interaction) which may lead to software hangs, i.e., the system is active but no longer capable of delivering its services. The paper proposes a detection approach to uncover application hangs. It exploits multiple indirect data gathered at the operating system level to monitor the system and to trigger alarms if the observed behavior deviates from the expected one. By means of fault injection experiments conducted on a research prototype, it is shown how the combination of several operating system monitors actually leads to an high quality of detection, at an acceptable overhead.