Fault prediction under the microscope: a closer look into HPC systems

Authors:
Ana Gainaru;Franck Cappello;Marc Snir;William Kramer
Affiliations:
UIUC, Urbana, IL;INRIA, France;MCS, ANL, IL;NCSA, UIUC, Urbana, IL
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 23
Cited 1

Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Mining Frequent Gradual Itemsets from Large Databases

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Online System Problem Detection by Mining Patterns of Console Logs

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Mining dependency in distributed systems through unstructured logs analysis

ACM SIGOPS Operating Systems Review
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpointing vs. Migration for Post-Petascale Supercomputers

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
A practical failure prediction with location and lead time for Blue Gene/P

DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture
Event log mining tool for large scale HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Improving Log-based Field Failure Data Analysis of multi-node computing systems

DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
PGP-mc: towards a multicore parallel approach for mining gradual patterns

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. Thus far, research in this field has used ideal predictors that were not implemented in real HPC systems. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA (Event Log Signal Analyzer) toolkit and offer an adaptive and more efficient prediction module. Our goal is to provide models that characterize the normal behavior of a system and the way faults affect it. Being able to detect deviations from normality quickly is the foundation of accurate fault prediction. However, this is challenging because component failure dynamics are heterogeneous in space and time. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems and by investigating the characteristics and bottlenecks of each step of the prediction process. Furthermore, we analyze the prediction's precision and recall impact on current checkpointing strategies and highlight future improvements and directions for research in this field.