Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

Authors:
Ramendra K. Sahoo;Anand Sivasubramaniam;Mark S. Squillante;Yanyong Zhang
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY;Pennsylvania State University, University Park, PA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY;Rutgers University, Piscataway, NJ
Venue:
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Year:
2004

Citing 0
Cited 24

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Grid workflow scheduling based on reliability cost

Proceedings of the 2nd international conference on Scalable information systems
DGSS: A Dependability Guided Job Scheduling System for Grid Environment

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Modeling and Analysis of Checkpoint I/O Operations

ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Current research and practice in proactive fault management

International Journal of Computers and Applications
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
A delay-aware reliable event reporting framework for wireless sensor-actuator networks

Ad Hoc Networks
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Online event correlations analysis in system logs of large-scale cluster systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Evaluating cooperative checkpointing for supercomputing systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cooperative checkpointing theory

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Risk aware overbooking for commercial grids

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
The importance of complete data sets for job scheduling simulations

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Dependable Grid Workflow Scheduling Based on Resource Availability

Journal of Grid Computing
Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing complexity of hardware and software mandatesthe recognition of fault occurrence in system deploymentand management. While there are several techniquesto prevent and/or handle faults, there continues to be agrowing need for an in-depth understanding of system errorsand failures and their empirical and statistical properties.This understanding can help evaluate the effectivenessof different techniques for improving system availability, inaddition to developing new solutions. In this paper, we analyzethe empirical and statistical properties of system errorsand failures from a network of nearly 400 heterogeneousservers running a diverse workload over a year. While improvementsin system robustness continue to limit the numberof actual failures to a very small fraction of the recordederrors, the failure rates are significant and highly variable.Our results also show that the system error and failure patternsare comprised of time-varying behavior containinglong stationary intervals. These stationary intervals exhibitvarious strong correlation structures and periodic patterns,which impact performance but also can be exploited to addresssuch performance issues.