Reliable DAG scheduling on grids with rewinding and migration

Authors:
Israel Hernandez;Murray Cole
Affiliations:
University of Edinburgh;University of Edinburgh
Venue:
Proceedings of the first international conference on Networks for grid applications
Year:
2007

Citing 4
Cited 2

Fault tolerance in distributed systems

Fault tolerance in distributed systems
Faults in Grids: Why are they so bad and What can be done about it?

GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Reactive grid scheduling of DAG applications

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Scheduling DAGs on grids with copying and migration

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics

Improving workflow fault tolerance through provenance-based recovery

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
A dynamic rescheduling algorithm for resource management in large scale dependable distributed systems

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerance is an important issue in Grid Computing as the availability of Grid resources can not be guaranteed. Effective scheduling methods must include fault tolerant mechanisms to preserve the execution of DAG applications, despite the presence of a processor failure. To address this, we designed the DAG rewinding mechanism, an event-driven process executed when a failure is detected at some rescheduling point. The rewinding mechanism preserves the execution of the application by recomputing and migrating those tasks which will disrupt the forward execution of succeeding tasks. The mechanism rewinds the progress of the application to a previous state, thereby preserving the execution despite the failed processor(s). This paper extends our work in the area by adding the rewinding mechanism to our previous dynamic scheduling methods GTP and GTP/c. We show how to integrate the rewinding mechanism within our dynamic execution models.