Failure Diagnosis Using Decision Trees
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Guided Problem Diagnosis through Active Learning
ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Mining console logs for large-scale system problem detection
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Leveraging complex event processing for grid monitoring
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part II
Hi-index | 0.00 |
Diagnosing cause of system failure in data centers that house large interconnected complex computer systems is a herculean task. This is because different monitoring tools for network, storage, server, facilities and application provide useful information regarding the health of the communication systems, the storage arrays, the physical machines, the environmental factors and the applications within a data center respectively in only a piece-meal manner. The existing tools fail to provide a comprehensive view of the complete set of operations within a data-center. In the absence of integrated monitoring and management tools, a data center administrator has to manually shuffle through and analyze data from various logs generated by the disparate monitoring tools on occurrence of a fault for identifying the root cause. In this paper we propose an approach for integrated data center health monitoring and management framework on top of the existing monitoring tools. The integrated framework leverages complex event processing techniques to process massive streams of events from these tools in (near) real time and enables automatic reuse of the existing monitoring tools in a non-intrusive manner.