FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
NADO: network anomaly detection using outlier approach
Proceedings of the 2011 International Conference on Communication, Computing & Security
Event log mining tool for large scale HPC systems
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Framework for enabling system understanding
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
RainMon: an integrated approach to mining bursty timeseries monitoring data
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
3-Dimensional root cause diagnosis via co-analysis
Proceedings of the 9th international conference on Autonomic computing
Failure prediction based on log files using Random Indexing and Support Vector Machines
Journal of Systems and Software
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
Cloud based intelligent system for delivering health care as a service
Computer Methods and Programs in Biomedicine
Hi-index | 0.00 |
When a system fails to function properly, health-related data are collected for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. The traditional manual approach is time-consuming, error-prone, and even worse, not scalable. In this paper, we present an automated mechanism for node-level anomaly identification in large-scale systems. A set of techniques is presented to automatically analyze collected data: data transformation to construct a uniform data format for data analysis, feature extraction to reduce data size, and unsupervised learning to detect the nodes acting differently from others. Moreover, we compare two techniques, principal component analysis (PCA) and independent component analysis (ICA), for feature extraction. We evaluate our prototype implementation by injecting a variety of faults into a production system at NCSA. The results show that our mechanism, in particular, the one using ICA-based feature extraction, can effectively identify faulty nodes with high accuracy and low computation overhead.