PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

  • Authors:
  • Hui Kang;Haifeng Chen;Guofei Jiang

  • Affiliations:
  • SUNY Stony Brook University, Stony Brook, NY, USA;NEC Laboratories America, Princeton, NJ, USA;NEC Laboratories America, Princeton, NJ, USA

  • Venue:
  • Proceedings of the 7th international conference on Autonomic computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Server virtualization is now becoming an effective means to consolidate numerous applications into a small number of machines. While such a strategy can lead to significant savings in power and hardware cost, it may complicate the fault management task due to the increasing scalability and complexity in the virtualized environment. In this paper, we propose PeerWatch, a fault detection and diagnosis tool specially designed for virtualized consolidation systems. Based on the observation that each application usually reveals itself in multiple instances in the virtualized data center, PeerWatch introduces a statistical technique, canonical correlation analysis (CCA), to extract the correlated characteristics between multiple application instances. The extracted correlations are utilized to examine the status of each application instance. If some correlations drop significantly during the operation, PeerWatch regards that the system is in faulty situation and produces alarms. PeerWatch is robust to system dynamics, compared to traditional fault detection techniques and thus can avoid a lot of false alarms. Once the fault has been detected, PeerWatch proposes a diagnosis process that also takes advantage of the multiple instances feature in the virtualized systems. The diagnosis combines the spatial and temporal analysis on the measurement data across multiple instances before and after the failure. As a result, PeerWatch can obtain much accurate clues about the fault root cause. Experimental results in our virtualized testbed system have demonstrated the effectiveness of the proposed detection and diagnosis tool.