A model for space-correlated failures in large-scale distributed systems

Authors:
Matthieu Gallet;Nezih Yigitbasi;Bahman Javadi;Derrick Kondo;Alexandru Iosup;Dick Epema
Affiliations:
Delft University of Technology, The Netherlands and The Failure Trace Archive;Delft University of Technology, The Netherlands and The Failure Trace Archive;INRIA Grenoble, France and The Failure Trace Archive;INRIA Grenoble, France and The Failure Trace Archive;Delft University of Technology, The Netherlands and The Failure Trace Archive;Delft University of Technology, The Netherlands and The Failure Trace Archive
Venue:
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Year:
2010

Citing 14
Cited 2

Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Total recall: system support for automated availability management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Exploiting availability prediction in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Derivation and Calibration of a Transient Error Reliability Model

IEEE Transactions on Computers
A Statistical Failure/Load Relationship: Results of a Multicomputer Study

IEEE Transactions on Computers
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single failure can trigger within a short time span several more failures, forming a group of time-correlated failures. To eliminate or alleviate the significant effects of failures on performance and functionality, the techniques for dealing with failures require good failure models. However, not many such models are available, and the available models are valid for few or even a single distributed system. In contrast, in this work we propose a model that considers groups of time-correlated failures and is valid for many types of distributed systems. Our model includes three components, the group size, the group inter-arrival time, and the resource downtime caused by the group. To validate this model, we use failure traces corresponding to fifteen distributed systems. We find that space-correlated failures are dominant in terms of resource downtime in seven of the fifteen studied systems. For each of these seven systems, we provide a set of model parameters that can be used in research studies or for tuning distributed systems. Last, as a result of our work six of the studied traces have been made available through the Failure Trace Archive (http://fta.inria.fr).