Quantifying event correlations for proactive failure management in networked computing systems

Authors:
Song Fu;Cheng-Zhong Xu
Affiliations:
Department of Computer Science and Engineering, University of North Texas, United States;Department of Electrical and Computer Engineering, Wayne State University, United States
Venue:
Journal of Parallel and Distributed Computing
Year:
2010

Citing 27
Cited 5

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
On the Reliability of the IBM MVS/XA Operating System

IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Vision of Autonomic Computing

Computer
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Predicting Rare Events In Temporal Domains

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The dawning of the autonomic computing era

IBM Systems Journal
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Service Migration in Distributed Virtual Machines for Adaptive Grid Computing

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Empirical Assessment of Machine Learning based Software Defect Prediction Techniques

WORDS '05 Proceedings of the 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Decentralized Local Failure Detection in Dynamic Distributed Systems

SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
An SNMP based failure detection service

SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines

Journal of Parallel and Distributed Computing
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
BlackJack: Hard Error Detection with Redundant Threads on SMT

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Exploiting availability prediction in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Predicting failures of computer systems: a case study for a telecommunication system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Dependability enhancement for coalition clusters with autonomic failure management

ISCC '10 Proceedings of the The IEEE symposium on Computers and Communications

Randomized load balancing strategies with churn resilience in peer-to-peer networks

Journal of Network and Computer Applications
A basic model for proactive event-driven computing

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing
An automated approach to forecasting QoS attributes based on linear and non-linear time series modeling

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
A decentralized approach for mining event correlations in distributed system monitoring

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7-85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment.