Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Event correlation using rule and object based techniques
Proceedings of the fourth international symposium on Integrated network management IV
An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation
Journal of Network and Systems Management
An Expert System for Real Time Fault Diagnosis of the Italian Telecommunications Network
Proceedings of the IFIP TC6/WG6.6 Third International Symposium on Integrated Network Management with participation of the IEEE Communications Society CNOM and with support from the Institute for Educational Services
Probabilistic fault diagnosis in communication systems through incremental hypothesis updating
Computer Networks: The International Journal of Computer and Telecommunications Networking
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support
LISA '03 Proceedings of the 17th USENIX conference on System administration
IP fault localization via risk modeling
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
A graph-based proactive fault identification approach in computer networks
Computer Communications
A root cause localization model for large scale systems
HotDep'05 Proceedings of the First conference on Hot topics in system dependability
High speed and robust event correlation
IEEE Communications Magazine
Hi-index | 0.00 |
Faults due to configuration of resources account for majority of errors in distributed software systems. Yet, the problem of identifying faulty configuration remains at large. Current approaches for fault identification are focused on event correlation techniques which suffer from limited granular data generated by software components. As complexity of cloud environments increase, resource sharing increases many-fold thereby making it even harder to isolate configuration faults through analysis of events. In this paper, we propose a scalable approach that not only identifies the presence of a configuration fault but also attempts to nail down the parameter that is the source of the observed fault. We leverage the knowledge of shared resources in the environment and use a simple matrix representation for providing near real-time analysis of the faults. This enables the solution to be used for both reactive management and for automated proactive problem determination. Experiments through simulations demonstrate that our approach is effective in identifying configuration faults with reduced time and increased accuracy. Our algorithm gracefully handles the complexity of the problem as the system size grows.