Design and Implementation of a Fault Tolerant Job Flow Manager Using Job Flow Patterns and Recovery Policies

Authors:
Selim Kalayci;Onyeka Ezenwoye;Balaji Viswanathan;Gargi Dasgupta;S. Masoud Sadjadi;Liana Fong
Affiliations:
Florida International University, Miami, USA;South Dakota State University, Brookings, USA;IBM India Research Lab, New Delhi, India;IBM India Research Lab, New Delhi, India;Florida International University, Miami, USA;IBM Watson Research Center, Hawthorne, NY, USA
Venue:
ICSOC '08 Proceedings of the 6th International Conference on Service-Oriented Computing
Year:
2008

Citing 4
Cited 1

Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Enabling Interoperability among Meta-Schedulers

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
BPEL4Job: A Fault-Handling Design for Job Flow Management

ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
Workflow exception patterns

CAiSE'06 Proceedings of the 18th international conference on Advanced Information Systems Engineering

Static and dynamic adaptations for service-based systems

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.