Faults in large distributed systems and what we can do about them

Authors:
George Kola;Tevfik Kosar;Miron Livny
Affiliations:
Computer Sciences Department, University of Wisconsin-Madison, Madison, WI;Computer Sciences Department, University of Wisconsin-Madison, Madison, WI;Computer Sciences Department, University of Wisconsin-Madison, Madison, WI
Venue:
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Year:
2005

Citing 8
Cited 6

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
The structure of the “THE”-multiprogramming system

Communications of the ACM
Pipeline and Batch Sharing in Grid Workloads

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Fail-Stutter Fault Tolerance

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
A fully automated fault-tolerant system for distributed video processing and off-site replication

NOSSDAV '04 Proceedings of the 14th international workshop on Network and operating systems support for digital audio and video
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications
A client-centric grid knowledgebase

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

Evaluating the reliability of computational grids from the end user's point of view

Journal of Systems Architecture: the EUROMICRO Journal
EIO: error handling is occasionally correct

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Modeling and analysis of the effects of QoS and reliability on pricing, profitability, and risk management in multiperiod grid-computing networks

Decision Support Systems
Towards autonomic management for Cloud services based upon volunteered resources

Concurrency and Computation: Practice & Experience
Cloud Platform Datastore Support

Journal of Grid Computing
Distributed debugging for mobile networks

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientists are increasingly using large distributed systems built from commodity off-the-shelf components to perform scientific computation. Grid computing has expanded the scale of such systems by spanning them across organizations. While such systems are cost-effective, the usage of large number of commodity components causes high fault and failure rates. Some of these faults result in silent data corruption leaving users with possibly incorrect results. In this work, we analyzed the faults and failures that occurred in Condor pools at UW-Madison having a few thousand CPUs and in two large distributed applications: US-CMS and BMRB BLAST, each of which used hundreds of thousands of CPU hours. We propose ‘silent-fail-stutter' fault-model to correctly model the silent failures and detail how to handle them. Based on the model, we have designed mechanisms that automatically detect and handle silent failures and ensure that users get correct results. Our mechanisms perform automated fault location and can transparently adapt applications to avoid faulty machines. We also designed a data provenance mechanism that tracks the origin of the results, enabling scientists to selectively purge results from faulty components.