Active Diagnosis of High-Level Faults in Distributed Internet Services
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
A statistical approach to detect application-level failures in internet services
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems
International Journal of Adaptive, Resilient and Autonomic Systems
Hi-index | 0.00 |
For dependability outages in distributed internet infrastructures, it is often not enough to detect a failure, but it is also required to diagnose it, i.e., to identify its source. Complex applications deployed in multi-tier environments make diagnosis challenging because of fast error propagation, black-box applications, high diagnosis delay, the amount of states that can be maintained, and imperfect diagnostic tests. Here, we propose a probabilistic diagnosis model for arbitrary failures in components of a distributed application. The monitoring system (the Monitor) passively observes the message exchanges between the components and, at runtime, performs a probabilistic diagnosis of the component that was the root cause of a failure. We demonstrate the approach by applying it to the Pet Store J2EE application, and we compare it with Pinpoint by quantifying latency and accuracy in both systems. The Monitor outperforms Pinpoint by achieving comparably accurate diagnosis with higher precision in shorter time.