Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Grid workflow scheduling based on reliability cost
Proceedings of the 2nd international conference on Scalable information systems
DGSS: A Dependability Guided Job Scheduling System for Grid Environment
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Modeling and Analysis of Checkpoint I/O Operations
ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Current research and practice in proactive fault management
International Journal of Computers and Applications
Journal of Parallel and Distributed Computing
An Analysis of Traces from a Production MapReduce Cluster
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Online event correlations analysis in system logs of large-scale cluster systems
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Evaluating cooperative checkpointing for supercomputing systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Risk aware overbooking for commercial grids
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
The importance of complete data sets for job scheduling simulations
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Dependable Grid Workflow Scheduling Based on Resource Availability
Journal of Grid Computing
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
The growing complexity of hardware and software mandatesthe recognition of fault occurrence in system deploymentand management. While there are several techniquesto prevent and/or handle faults, there continues to be agrowing need for an in-depth understanding of system errorsand failures and their empirical and statistical properties.This understanding can help evaluate the effectivenessof different techniques for improving system availability, inaddition to developing new solutions. In this paper, we analyzethe empirical and statistical properties of system errorsand failures from a network of nearly 400 heterogeneousservers running a diverse workload over a year. While improvementsin system robustness continue to limit the numberof actual failures to a very small fraction of the recordederrors, the failure rates are significant and highly variable.Our results also show that the system error and failure patternsare comprised of time-varying behavior containinglong stationary intervals. These stationary intervals exhibitvarious strong correlation structures and periodic patterns,which impact performance but also can be exploited to addresssuch performance issues.