Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
ICTSS'11 Proceedings of the 23rd IFIP WG 6.1 international conference on Testing software and systems
Hi-index | 0.00 |
In this paper, we present the SNMP-FD service, a novel failure detection service entirely based on the Simple Network Management Protocol (SNMP). This approach promises better interoperability with external tools and failure information sources, including network equipment and cluster management tools. We first show how the SNMP standard can be used to build a failure detection service. We describe the already standardized interfaces that can be reused and introduce the interfaces that need to be added. SNMP is used extensively in the service: for messaging, process status description, configuration, services statistics and delivering failure detection information to applications. We then present our implementation and an evaluation of performance and quality of service.