A recovery mechanism for errors caused by a late subjob in a system handling SLA-based Grid workflows

  • Authors:
  • Dang Minh Quan;Jorn Altmann

  • Affiliations:
  • School of Information Technology, International University in Germany, Germany.;School of Information Technology, International University in Germany, Germany/ Technology Management, Economics and Policy Program (/TEMEP)/, School of Engineering, Seoul National Universit ...

  • Venue:
  • International Journal of Web and Grid Services
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Supporting SLAs (Service Level Agreements) for Grid-based workflows requires providing mechanisms for handling errors (i.e., the failures of subjobs). In the context of this paper, we propose an error recovery mechanism which can handle one failed subjob of a workflow. The error recovery mechanism has a maximum of three phases, depending on the impact of the error. In each phase, we use a dedicated algorithm to remap the subjobs of the workflow to the resources. The main contributions of the paper are the error recovery mechanism for SLA-based workflows and the mapping algorithm G-map, which is used in the first phase of the recovery mechanism. The G-map remaps the groups of subjobs, which are directly affected by an error. The efficiency of the proposed algorithm is validated through simulation results.