Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
Two techniques for transient software error recovery
Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
Event Log based Dependability Analysis of Windows NT and 2K Systems
PRDC '02 Proceedings of the 2002 Pacific Rim International Symposium on Dependable Computing
Measurement of Failure Rate in Widely Distributed Software
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Data Mining Approaches to Software Fault Diagnosis
RIDE '05 Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications
Construction of a Highly Dependable Operating System
EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Emulation of Software Faults: A Field Data Study and a Practical Approach
IEEE Transactions on Software Engineering
Data mining approaches for intrusion detection
SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
An Approach to Concurrent Control Flow Checking
IEEE Transactions on Software Engineering
A sense of self for Unix processes
SP'96 Proceedings of the 1996 IEEE conference on Security and privacy
Error detection framework for complex software systems
EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
OS-level hang detection in complex software systems
International Journal of Critical Computer-Based Systems
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Hi-index | 0.00 |
On-line failure detection is an essential means to control and assess the dependability of complex and critical software systems. In such context, effective detection strategies are required, in order to minimize the possibility of catastrophic consequences. This objective is however difficult to achieve in complex systems, especially due to the several sources of non-determinism (e.g., multi-threading and distributed interaction) which may lead to software hangs, i.e., the system is active but no longer capable of delivering its services. The paper proposes a detection approach to uncover application hangs. It exploits multiple indirect data gathered at the operating system level to monitor the system and to trigger alarms if the observed behavior deviates from the expected one. By means of fault injection experiments conducted on a research prototype, it is shown how the combination of several operating system monitors actually leads to an high quality of detection, at an acceptable overhead.