On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Software Fault Tolerance: A Tutorial
Software Fault Tolerance: A Tutorial
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Behavior Capture and Test: Automated Analysis of Component Integration
ICECCS '05 Proceedings of the 10th IEEE International Conference on Engineering of Complex Computer Systems
Data Mining Approaches to Software Fault Diagnosis
RIDE '05 Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications
Automated Online Monitoring of Distributed Applications through External Monitors
IEEE Transactions on Dependable and Secure Computing
Emulation of Software Faults: A Field Data Study and a Practical Approach
IEEE Transactions on Software Engineering
The Daikon system for dynamic detection of likely invariants
Science of Computer Programming
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Online Monitoring of Software System Reliability
EDCC '10 Proceedings of the 2010 European Dependable Computing Conference
PRDC '10 Proceedings of the 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing
A sense of self for Unix processes
SP'96 Proceedings of the 1996 IEEE conference on Security and privacy
Operating system support to detect application hangs
VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
Hi-index | 0.00 |
Software systems employed in critical scenarios are increasingly large and complex. The usage of many heterogeneous components causes complex interdependences, and introduces sources of non-determinism, that often lead to the activation of subtle faults. Such behaviors, due to their complex triggering patterns, typically escape the testing phase. Effective on-line monitoring is the only way to detect them and to promptly react in order to avoid more serious consequences. In this paper, we propose an error detection framework to cope with software failures, which combines multiple sources of data gathered both at application-level and OS-level. The framework is evaluated through a fault injection campaign on a complex system from the Air Traffic Management (ATM) domain. Results show that the combination of several monitors is effective to detect errors in terms of false alarms, precision and recall.