Failure prediction for HPC systems and applications: Current situation and open issues

Authors:
Ana Gainaru;Franck Cappello;Marc Snir;William Kramer
Affiliations:
National Centre for Supercomputing Applications, Urbana, IL, USA, University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA, INRIA, Rocquencourt, Le Chesnay Cedex, France;University of Illinois at Urbana-Champaign, Urbana, IL, USA, Argonne National Laboratory, Argonne, IL, USA;National Centre for Supercomputing Applications, Urbana, IL, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2013

Citing 18
Cited 0

Software reliability modeling survey

Handbook of software reliability engineering
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Using Hidden Semi-Markov Models for Effective Online Failure Prediction

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Anomaly localization in large-scale clusters

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Predictive algorithms in the management of computer systems

IBM Systems Journal
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Mining dependency in distributed systems through unstructured logs analysis

ACM SIGOPS Operating Systems Review
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
A practical failure prediction with location and lead time for Blue Gene/P

DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
A Large-Scale Study of Failures in High-Performance Computing Systems

IEEE Transactions on Dependable and Secure Computing
Trends in High-Performance Computing

Computing in Science and Engineering
Event log mining tool for large scale HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Co-analysis of RAS Log and Job Log on Blue Gene/P

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. One way of offering prediction is by the analysis of system logs generated during production by large-scale systems. Current research in this field presents a number of limitations that make them unusable for running on real production high-performance computing (HPC) systems. Based on our observations that different failures have different distributions and behaviours, we propose a novel hybrid approach that combines signal analysis with data mining in order to overcome current limitations. We show that by analysing each event according to its specific behaviour, our prediction provides a precision of over 90% and its able to discover about 50% of all failures in a system, result which allows its integration in proactive fault tolerance protocols.