A fault-tolerance architecture for Kepler-based distributed scientific workflows

Authors:
Pierre Mouallem;Daniel Crawl;Ilkay Altintas;Mladen Vouk;Ustun Yildiz
Affiliations:
North Carolina State University, Raleigh, NC;San Diego Supercomputer Center, University of California San Diego, La Jolla, CA;San Diego Supercomputer Center, University of California San Diego, La Jolla, CA;North Carolina State University, Raleigh, NC;University of California Davis, Davis, CA
Venue:
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Year:
2010

Citing 10
Cited 4

Fault-tolerant software reliability engineering

Handbook of software reliability engineering
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Taverna: a tool for the composition and enactment of bioinformatics workflows

Bioinformatics
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Biological Experiments on the Grid: A Novel Workflow Management Platform

CBMS '07 Proceedings of the Twentieth IEEE International Symposium on Computer-Based Medical Systems
The N-Version Approach to Fault-Tolerant Software

IEEE Transactions on Software Engineering
Provenance collection support in the kepler scientific workflow system

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Toward self-describing and workflow integrated Earth system models: A coupled atmosphere-ocean modeling system application

Environmental Modelling & Software
Supporting undo and redo in scientific data analysis

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
Supporting undo and redo in scientific data analysis

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault-tolerance and failure recovery in scientific workflows is still a relatively young topic. The work done in the domain so far mostly applies classic fault-tolerance mechanisms, such as "alternative versions" and "checkpointing", to scientific workflows. Often scientific workflow systems simply rely on the fault-tolerance capabilities provided by their third party subcomponents such as schedulers, Grid resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system typically sees them only as failed steps in the process without additional detail and the ability of the system to recover from those failures may be limited. In this paper, we present an architecture that tries to address this for Kepler-based scientific workflows by providing more information about failures and faults we have observed, and through a supporting implementation of more comprehensive failure coverage and recovery options. We discuss our framework in the context of the failures observed in two production-level Kepler-based workflows, specifically XGC and S3D. The framework is divided into three major components: (i) a general contingency Kepler actor that provides a recovery block functionality at the workflow level, (ii) an external monitoring module that tracks the underlying workflow components, and monitors the overall health of the workflow execution, and (iii) a checkpointing mechanism that provides smart resume capabilities for cases in which an unrecoverable error occurs. This framework takes advantage of the provenance data collected by the Kepler-based workflows to detect failures and help in fault-tolerance decision making.