An analysis of clustered failures on large supercomputing systems

Authors:
Thomas J. Hacker;Fabian Romero;Christopher D. Carothers
Affiliations:
Computer & Information Technology, Purdue University, 401 North Grant Street, West Lafayette, IN 47907, USA and Discovery Park Cyber Center, Purdue University, West Lafayette, IN 47907, USA;Computer & Information Technology, Purdue University, 401 North Grant Street, West Lafayette, IN 47907, USA;Department of Computer Science, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, USA
Venue:
Journal of Parallel and Distributed Computing
Year:
2009

Citing 21
Cited 7

Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Analyzing Software Measurement Data with Clustering Techniques

IEEE Intelligent Systems
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A comprehensive model of the supercomputer workload

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Introduction to Probability Models, Ninth Edition

Introduction to Probability Models, Ninth Edition
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Failure-aware checkpointing in fine-grained cycle sharing systems

Proceedings of the 16th international symposium on High performance distributed computing
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
A Meta-Learning Failure Predictor for Blue Gene/L Systems

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Failure Prediction in IBM BlueGene/L Event Logs

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Proactive management of software aging

IBM Journal of Research and Development
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
`Neural-gas' network for vector quantization and its application to time-series prediction

IEEE Transactions on Neural Networks

End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
Online event correlations analysis in system logs of large-scale cluster systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Flexible resource allocation for reliable virtual cluster computing systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Resource failures risk assessment modelling in distributed environments

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.