Theoretical Computer Science
Blueprints for high availability: designing resilient distributed systems
Blueprints for high availability: designing resilient distributed systems
Data mining: concepts and techniques
Data mining: concepts and techniques
Feature Selection for Knowledge Discovery and Data Mining
Feature Selection for Knowledge Discovery and Data Mining
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Automatic alarm correlation for fault identification
INFOCOM '95 Proceedings of the Fourteenth Annual Joint Conference of the IEEE Computer and Communication Societies (Vol. 2)-Volume - Volume 2
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Eigenspace-based anomaly detection in computer systems
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 2006 ACM symposium on Applied computing
Magpie: online modelling and performance-aware systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Discovering Likely Invariants of Distributed Transaction Systems for Autonomic System Management
ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Reference-driven performance anomaly identification
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Light-weight black-box failure detection for distributed systems
Proceedings of the 2012 workshop on Management of big data systems
Data reconciliation in a smart home sensor network
Expert Systems with Applications: An International Journal
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
Server virtualization is now becoming an effective means to consolidate numerous applications into a small number of machines. While such a strategy can lead to significant savings in power and hardware cost, it may complicate the fault management task due to the increasing scalability and complexity in the virtualized environment. In this paper, we propose PeerWatch, a fault detection and diagnosis tool specially designed for virtualized consolidation systems. Based on the observation that each application usually reveals itself in multiple instances in the virtualized data center, PeerWatch introduces a statistical technique, canonical correlation analysis (CCA), to extract the correlated characteristics between multiple application instances. The extracted correlations are utilized to examine the status of each application instance. If some correlations drop significantly during the operation, PeerWatch regards that the system is in faulty situation and produces alarms. PeerWatch is robust to system dynamics, compared to traditional fault detection techniques and thus can avoid a lot of false alarms. Once the fault has been detected, PeerWatch proposes a diagnosis process that also takes advantage of the multiple instances feature in the virtualized systems. The diagnosis combines the spatial and temporal analysis on the measurement data across multiple instances before and after the failure. As a result, PeerWatch can obtain much accurate clues about the fault root cause. Experimental results in our virtualized testbed system have demonstrated the effectiveness of the proposed detection and diagnosis tool.