BlueGene/L Failure Analysis and Prediction Models

Authors:
Yinglung Liang;Yanyong Zhang;Anand Sivasubramaniam;Morris Jette;Ramendra Sahoo
Affiliations:
Rutgers University;Rutgers University;Penn State University;Lawrence Livermore National Laboratory;IBM T. J. Watson Research Center
Venue:
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Year:
2006

Citing 0
Cited 24

Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Proceedings of the 2009 workshop on Resiliency in high performance
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Toward Exascale Resilience

International Journal of High Performance Computing Applications
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Diagnosis of recurrent faults using log files

CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
DAGMap: efficient and dependable scheduling of DAG workflow job in Grid

The Journal of Supercomputing
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing

Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Online event correlations analysis in system logs of large-scale cluster systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Evaluating cooperative checkpointing for supercomputing systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Symptom-based problem determination using log data abstraction

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A decentralized approach for mining event correlations in distributed system monitoring

Journal of Parallel and Distributed Computing
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Failure prediction for HPC systems and applications: Current situation and open issues

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Ear- lier work has shown that conventional runtime fault- tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure predic- tion has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a pe- riod of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.