Exploring event correlation for failure prediction in coalitions of clusters

Authors:
Song Fu;Cheng-Zhong Xu
Affiliations:
Wayne State University, Detroit, MI;Wayne State University, Detroit, MI
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 22
Cited 26

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
On the Reliability of the IBM MVS/XA Operating System

IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Vision of Autonomic Computing

Computer
Predicting Rare Events In Temporal Domains

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The dawning of the autonomic computing era

IBM Systems Journal
Recovery-Oriented Computing: Building Multitier Dependability

Computer
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Fast and transparent recovery for continuous availability of cluster-based servers

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Availability Modeling and Analysis on High Performance Cluster Computing Systems

ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Decentralized Local Failure Detection in Dynamic Distributed Systems

SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
An SNMP based failure detection service

SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
BlackJack: Hard Error Detection with Redundant Threads on SMT

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Exploiting availability prediction in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems

Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Proceedings of the 2009 workshop on Resiliency in high performance
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Troubleshooting thousands of jobs on production grids using data mining techniques

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Supporting fault-tolerance for time-critical events in distributed environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Supporting fault-tolerance for time-critical events in distributed environments

Scientific Programming
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing

Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Predicting computer system failures using support vector machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Online event correlations analysis in system logs of large-scale cluster systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Modelling pilot-job applications on production grids

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Randomized load balancing strategies with churn resilience in peer-to-peer networks

Journal of Network and Computer Applications
Failure-aware workflow scheduling in cluster environments

Cluster Computing
Vrisha: using scaling properties of parallel programs for bug detection and localization

Proceedings of the 20th international symposium on High performance distributed computing
Towards proactive event-driven computing

Proceedings of the 5th ACM international conference on Distributed event-based system
Event log mining tool for large scale HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A model of pilot-job resource provisioning on production grids

Parallel Computing
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Failure prediction and localization in large scientific workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Online workflow management and performance analysis with stampede

Proceedings of the 7th International Conference on Network and Services Management
An automated approach to forecasting QoS attributes based on linear and non-linear time series modeling

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
A decentralized approach for mining event correlations in distributed system monitoring

Journal of Parallel and Distributed Computing
Proactive event processing in action: a case study on the proactive management of transport processes (industry article)

Proceedings of the 7th ACM international conference on Distributed event-based systems
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
A job submission manager for large-scale distributed systems based on job futurity predictor

International Journal of Grid and Utility Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTs), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPREFECTs in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from May 2006 to April 2007.