Filtering Failure Logs for a BlueGene/L Prototype

Authors:
Yinglung Liang;Anand Sivasubramaniam;Jose Moreira
Affiliations:
Rutgers University;Penn State University;IBM T. J. Watson Research Center
Venue:
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Year:
2005

Citing 0
Cited 15

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
The co-replication methodology and its application to structured parallel programs

Proceedings of the 2007 symposium on Component and framework technology in high-performance and scientific computing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Modeling and Analysis of Checkpoint I/O Operations

ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Diagnosis of recurrent faults using log files

CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Failure-aware workflow scheduling in cluster environments

Cluster Computing
Lossless compression for large scale cluster logs

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cooperative checkpointing theory

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBMýs BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.