Design and evaluation of a self-healing Kepler for scientific workflows

Authors:
Arjun Hary;Ali Akoglu;Youssif AlNashif;Salim Hariri;Darrel Jenerette
Affiliations:
The University of Arizona, Tucson, AZ;The University of Arizona, Tucson, AZ;The University of Arizona, Tucson, AZ;The University of Arizona, Tucson, AZ;University of California, Riverside
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 2
Cited 2

GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management

Supporting undo and redo in scientific data analysis

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
Supporting undo and redo in scientific data analysis

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance

Quantified Score

Hi-index	0.00

Visualization

Abstract

Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.