Critical event prediction for proactive management in large-scale computer clusters

Authors:
R. K. Sahoo;A. J. Oliner;I. Rish;M. Gupta;J. E. Moreira;S. Ma;R. Vilalta;A. Sivasubramaniam
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;University of Houston, Houston, TX;Penn. State University, College Park, PA
Venue:
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2003

Citing 14
Cited 41

Discovering patterns in sequences of events

Artificial Intelligence
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
C4.5: programs for machine learning

C4.5: programs for machine learning
ANSWER: network monitoring using object-oriented rules

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Introduction to Bayesian Networks

Introduction to Bayesian Networks
Learning Belief Networks in the Presence of Missing Values and Hidden Variables

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Prediction-Based Real-Time Scheduling Advisor

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Classification Approach for Prediction of Target Events in Temporal Sequences

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A comparative analysis of event tupling schemes

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Predicting Rare Events In Temporal Domains

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
VAX/VMS Event Monitoring and Analysis

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Predictive algorithms in the management of computer systems

IBM Systems Journal
Learning the structure of dynamic probabilistic networks

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Short term performance forecasting in enterprise systems

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Better performance or better manageability?

DEAS '05 Proceedings of the 2005 workshop on Design and evolution of autonomic application software
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Bionic autonomic nervous system and self-healing for NASA ANTS-like missions

Proceedings of the 2007 ACM symposium on Applied computing
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Processing forecasting queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Grid workflow scheduling based on reliability cost

Proceedings of the 2nd international conference on Scalable information systems
The performance of bags-of-tasks in large-scale distributed systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Performance Problem Determination Using Combined Dependency Analysis for Reliable System

ATC '08 Proceedings of the 5th international conference on Autonomic and Trusted Computing
DGSim: Comparing Grid Resource Management Architectures through Trace-Based Simulation

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Multi-state grid resource availability characterization

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
DAGMap: efficient and dependable scheduling of DAG workflow job in Grid

The Journal of Supercomputing
Current research and practice in proactive fault management

International Journal of Computers and Applications
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Problem localization for automated system management in ubiquitous computing

EUC'07 Proceedings of the 2007 conference on Emerging direction in embedded and ubiquitous computing
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Fault perturbations in building sensor network data streams

International Journal of Sensor Networks
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing

Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Online event correlations analysis in system logs of large-scale cluster systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Evaluating cooperative checkpointing for supercomputing systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A proactive fault-detection mechanism in large-scale cluster systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cooperative checkpointing theory

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Hybrid inference architecture and model for self-healing system

APNOMS'06 Proceedings of the 9th Asia-Pacific international conference on Network Operations and Management: management of Convergence Networks and Services
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
RainMon: an integrated approach to mining bursty timeseries monitoring data

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dependable Grid Workflow Scheduling Based on Resource Availability

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.