Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Authors:
Yang Zhang;Anirban Mandal;Charles Koelbel;Keith Cooper
Affiliations:
-;-;-;-
Venue:
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Year:
2009

Citing 16
Cited 4

The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Benchmarking and comparison of the task graph scheduling algorithms

Journal of Parallel and Distributed Computing
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Grain Size Determination for Parallel Processing

IEEE Software
Toward a Framework for Preparing and Executing Adaptive Grid Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
A taxonomy of scientific workflow systems for grid computing

ACM SIGMOD Record
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Relative Performance of Scheduling Algorithms in Grid Environments

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Reliability-Aware Resource Management for Computational Grid/Cluster Environments

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Scheduling strategies for mapping application workflows onto the grid

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Performability modeling for scheduling and fault tolerance strategies for scientific workflows

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Scheduling scientific workflows to meet soft deadlines in the absence of failure models

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Architecture-based fault tolerance support for grid applications

Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
An effective job replication technique based on reliability and performance in mobile grids

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings

Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Complex scientific workflows are now Increasingly executed on computational grids. In addition to the challenges of managing and scheduling these workflows, reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.