Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
Automated support for classifying software failure reports
Proceedings of the 25th International Conference on Software Engineering
Measurement of Failure Rate in Widely Distributed Software
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Data Mining Approaches to Software Fault Diagnosis
RIDE '05 Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications
Construction of a Highly Dependable Operating System
EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Emulation of Software Faults: A Field Data Study and a Practical Approach
IEEE Transactions on Software Engineering
I/O system performance debugging using model-driven anomaly characterization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Checking system rules using system-specific, programmer-written compiler extensions
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Data mining approaches for intrusion detection
SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Vigilant: out-of-band detection of failures in virtual machines
ACM SIGOPS Operating Systems Review
Hang analysis: fighting responsiveness bugs
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Exploring recovery from operating system lockups
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Anomaly localization in large-scale clusters
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
An empirical study of reported bugs in server software with implications for automated bug diagnosis
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Adaptive monitoring in microkernel OSs
DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
A sense of self for Unix processes
SP'96 Proceedings of the 1996 IEEE conference on Security and privacy
Operating system support to detect application hangs
VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
Hi-index | 0.00 |
Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.