Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

Authors:
Ana Gainaru;Franck Cappello;Joshi Fullop;Stefan Trausan-Matu;William Kramer
Affiliations:
UIUC, NCSA, Urbana, IL and UPB, Bucharest, Romania;INRIA, France and UIUC, Urbana, IL;UIUC, NCSA, Urbana, IL;UPB, Bucharest, Romania;UIUC, NCSA, Urbana, IL
Venue:
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Year:
2011

Citing 15
Cited 2

Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Online System Problem Detection by Mining Patterns of Console Logs

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Mining dependency in distributed systems through unstructured logs analysis

ACM SIGOPS Operating Systems Review
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
A practical failure prediction with location and lead time for Blue Gene/P

DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
Event log mining tool for large scale HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which we lately use to predict the normal and faulty behaviour of the system. Our method uses a dynamic window strategy that is able to find frequent sequences of events regardless on the time delay between them. Most of the current related research narrows the correlation extraction to fixed and relatively small time windows that do not reflect the whole behaviour of the system. The generated events are in constant change during the lifetime of the machine. We consider that it is important to update the sequences at runtime by applying modifications after each prediction phase according to the forecast's accuracy and the difference between what was expected and what really happened. Our experiments show that our analysing system is able to predict around 60% of events with a precision of around 85% at a lower event granularity than before.