Anomaly localization in large-scale clusters

Authors:
Ziming Zheng;Yawei Li;Zhiling Lan
Affiliations:
Department of Computer Science, Illinois Institute of Technology, Chicago, 60616, USA;Department of Computer Science, Illinois Institute of Technology, Chicago, 60616, USA;Department of Computer Science, Illinois Institute of Technology, Chicago, 60616, USA
Venue:
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Year:
2007

Citing 0
Cited 3

OS-level hang detection in complex software systems

International Journal of Critical Computer-Based Systems
Framework for enabling system understanding

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Failure prediction for HPC systems and applications: Current situation and open issues

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

A critical problem facing by managing large-scale clusters is to identify the location of problems in a system in case of unusual events. As the scale of high performance computing (HPC) grows, systems are getting bigger. When a system fails to function properly, health-related data are collected for troubleshooting. However, due to the massive quantities of information obtained from a large number of components, the root causes of anomalies are often buried like needles in a haystack. In this paper, we present a localization method to automatically find out the potential root causes (i.e. a subset of nodes) of the problem from the overwhelming amount of data collected system-wide. System managers can focus on examining these potential locations, thereby significantly reducing human efforts required for anomaly localization. Our method consists of three interrelated steps: (1) feature collection to assemble a feature space for the system; (2) feature extraction to obtain the most significant features for efficient data analysis by applying the principal component analysis (PCA) algorithm; and (3) outlier detection to quickly identify the nodes that are “far away” from the majority by using the cell-based detection algorithm. Preliminary studies are presented to demonstrate the potential of our method for localizing anomalies in a computing environment where the nodes perform comparable tasks.