Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Analyzing Software Measurement Data with Clustering Techniques
IEEE Intelligent Systems
Filtering Failure Logs for a BlueGene/L Prototype
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A comprehensive model of the supercomputer workload
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Introduction to Probability Models, Ninth Edition
Introduction to Probability Models, Ninth Edition
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Failure-aware checkpointing in fine-grained cycle sharing systems
Proceedings of the 16th international symposium on High performance distributed computing
Using queue structures to improve job reliability
Proceedings of the 16th international symposium on High performance distributed computing
A Meta-Learning Failure Predictor for Blue Gene/L Systems
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Failure Prediction in IBM BlueGene/L Event Logs
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Proactive management of software aging
IBM Journal of Research and Development
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
`Neural-gas' network for vector quantization and its application to time-series prediction
IEEE Transactions on Neural Networks
End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
Online event correlations analysis in system logs of large-scale cluster systems
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Flexible resource allocation for reliable virtual cluster computing systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Resource failures risk assessment modelling in distributed environments
Journal of Systems and Software
Hi-index | 0.00 |
Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.