Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Experience with adapting a WS-BPEL runtime for eScience workflows
Proceedings of the 5th Grid Computing Environments Workshop
ART: adaptive, reliable, and fault-tolerant task management for computational grids
Proceedings of the 2010 ACM Symposium on Applied Computing
Localising temporal constraints in scientific workflows
Journal of Computer and System Sciences
Scheduling scientific workflows to meet soft deadlines in the absence of failure models
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Characterizing quality of resilience in scientific workflows
Proceedings of the 6th workshop on Workflows in support of large-scale science
Performance evaluation of cloud service considering fault recovery
The Journal of Supercomputing
Hi-index | 0.00 |
In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in workflows decreases from about 30% to 5% by using our fault-tolerance strategies.