Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Journal of Parallel and Distributed Computing
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing
Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Error log processing for accurate failure prediction
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Online event correlations analysis in system logs of large-scale cluster systems
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Randomized load balancing strategies with churn resilience in peer-to-peer networks
Journal of Network and Computer Applications
Dependable Grid Workflow Scheduling Based on Resource Availability
Journal of Grid Computing
Hi-index | 0.00 |
Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Grid, show the offline and online predictions by our predicting system can forecast 72.7% to 85.3% of the failure occurrences and capture failure correlations in cluster coalition environment.