Discovering patterns in sequences of events
Artificial Intelligence
Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
C4.5: programs for machine learning
C4.5: programs for machine learning
ANSWER: network monitoring using object-oriented rules
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Introduction to Bayesian Networks
Introduction to Bayesian Networks
Learning Belief Networks in the Presence of Missing Values and Hidden Variables
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Prediction-Based Real-Time Scheduling Advisor
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Classification Approach for Prediction of Target Events in Temporal Sequences
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A comparative analysis of event tupling schemes
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Predicting Rare Events In Temporal Domains
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
VAX/VMS Event Monitoring and Analysis
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Predictive algorithms in the management of computer systems
IBM Systems Journal
Learning the structure of dynamic probabilistic networks
UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Short term performance forecasting in enterprise systems
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Better performance or better manageability?
DEAS '05 Proceedings of the 2005 workshop on Design and evolution of autonomic application software
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
Bionic autonomic nervous system and self-healing for NASA ANTS-like missions
Proceedings of the 2007 ACM symposium on Applied computing
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Processing forecasting queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Grid workflow scheduling based on reliability cost
Proceedings of the 2nd international conference on Scalable information systems
The performance of bags-of-tasks in large-scale distributed systems
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Performance Problem Determination Using Combined Dependency Analysis for Reliable System
ATC '08 Proceedings of the 5th international conference on Autonomic and Trusted Computing
DGSim: Comparing Grid Resource Management Architectures through Trace-Based Simulation
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Multi-state grid resource availability characterization
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
DAGMap: efficient and dependable scheduling of DAG workflow job in Grid
The Journal of Supercomputing
Current research and practice in proactive fault management
International Journal of Computers and Applications
Journal of Parallel and Distributed Computing
Problem localization for automated system management in ubiquitous computing
EUC'07 Proceedings of the 2007 conference on Emerging direction in embedded and ubiquitous computing
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Fault perturbations in building sensor network data streams
International Journal of Sensor Networks
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing
Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Online event correlations analysis in system logs of large-scale cluster systems
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Evaluating cooperative checkpointing for supercomputing systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A proactive fault-detection mechanism in large-scale cluster systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Hybrid inference architecture and model for self-healing system
APNOMS'06 Proceedings of the 9th Asia-Pacific international conference on Network Operations and Management: management of Convergence Networks and Services
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
RainMon: an integrated approach to mining bursty timeseries monitoring data
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dependable Grid Workflow Scheduling Based on Resource Availability
Journal of Grid Computing
Hi-index | 0.00 |
As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.