Error recovery mechanism for grid-based workflow within SLA context

Authors:
Dang Minh Quan
Affiliations:
Paderborn Center for Parallel Computing (PC2), University of Paderborn, Fuerstenalle 11, Paderborn, 33102, Germany
Venue:
International Journal of High Performance Computing and Networking
Year:
2007

Citing 13
Cited 5

The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
Experiences with predicting resource performance on-line in computational grid settings

ACM SIGMETRICS Performance Evaluation Review
Specifying and Monitoring Guarantees in Commercial Grids through SLA

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Heuristics for Scheduling Parameter Sweep Applications in Grid Environments

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
On Architecture for SLA-Aware Workflows in Grid Environments

AINA '05 Proceedings of the 19th International Conference on Advanced Information Networking and Applications - Volume 1
The virtual resource manager: an architecture for SLA-aware resource management

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Critical-Path and Priority based Algorithms for Scheduling Workflows with Parameter Sweep Tasks on Global Grids

SBAC-PAD '05 Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing
New grid scheduling and rescheduling methods in the GrADS project

International Journal of Parallel Programming - Special issue: The next generation software program
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
SLA negotiation protocol for grid-based workflows

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Transparent fault tolerance for grid applications

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Mapping workflows onto grid resources within an SLA context

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing

A recovery mechanism for errors caused by a late subjob in a system handling SLA-based Grid workflows

International Journal of Web and Grid Services
Mapping Heavy Communication Grid-Based Workflows Onto Grid Resources Within an SLA Context Using Metaheuristics

International Journal of High Performance Computing Applications
Resource allocation algorithm for light communication grid-based workflows within an SLA context

International Journal of Parallel, Emergent and Distributed Systems
Business model and the policy of mapping light communication grid-based workflow within the SLA Context

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Performance evaluation of cloud service considering fault recovery

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Service Level Agreements (SLAs) serve as a foundation for a reliable and predictable job execution at remote grid sites. In this paper, we describe an error recovery mechanism for workflow within the SLA context, coping with catastrophic failure when one or several High Performance Computing Centers (HPCCs) are detached from the grid system. We propose an algorithm to detect all affected sub-jobs when the error happens and an algorithm to remap those sub-jobs to the remaining healthy HPCCs with makespan optimise. The experiment result shows that our mechanism discovers a higher quality solution in a shorter time period than other existing methods.