Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Modeling the Impact of Checkpoints on Next-Generation Systems
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
Online System Problem Detection by Mining Patterns of Console Logs
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Mining dependency in distributed systems through unstructured logs analysis
ACM SIGOPS Operating Systems Review
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
A practical failure prediction with location and lead time for Blue Gene/P
DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
Event log mining tool for large scale HPC systems
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
3-Dimensional root cause diagnosis via co-analysis
Proceedings of the 9th international conference on Autonomic computing
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which we lately use to predict the normal and faulty behaviour of the system. Our method uses a dynamic window strategy that is able to find frequent sequences of events regardless on the time delay between them. Most of the current related research narrows the correlation extraction to fixed and relatively small time windows that do not reflect the whole behaviour of the system. The generated events are in constant change during the lifetime of the machine. We consider that it is important to update the sequences at runtime by applying modifications after each prediction phase according to the forecast's accuracy and the difference between what was expected and what really happened. Our experiments show that our analysing system is able to predict around 60% of events with a precision of around 85% at a lower event granularity than before.