Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Authors:
Gopi Kandaswamy;Anirban Mandal;Daniel A. Reed
Affiliations:
-;-;-
Venue:
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Year:
2008

Citing 0
Cited 9

Analysis of application heartbeats: learning structural and temporal features in time series data for identification of performance problems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Experience with adapting a WS-BPEL runtime for eScience workflows

Proceedings of the 5th Grid Computing Environments Workshop
ART: adaptive, reliable, and fault-tolerant task management for computational grids

Proceedings of the 2010 ACM Symposium on Applied Computing
Localising temporal constraints in scientific workflows

Journal of Computer and System Sciences
Scheduling scientific workflows to meet soft deadlines in the absence of failure models

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Characterizing quality of resilience in scientific workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Performance evaluation of cloud service considering fault recovery

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in workflows decreases from about 30% to 5% by using our fault-tolerance strategies.