Fault-tolerant dynamic job scheduling policy

Authors:
J. H. Abawajy
Affiliations:
School of Information Technology, Deakin University, Geelong, VIC., Australia
Venue:
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Year:
2005

Citing 15
Cited 1

Performance prediction of parallel systems with scalable specifications—methodology and case study

ACM SIGMETRICS Performance Evaluation Review
The Globus toolkit

The grid
Byzantine generals in action: implementing fail-stop processors

ACM Transactions on Computer Systems (TOCS)
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
A Scheduling Model for Grid Computing Systems

GRID '01 Proceedings of the Second International Workshop on Grid Computing
Fault Tolerant Wide-Area Parallel Computing

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Effective Metacomputing using LSF MultiCluster

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Fault Tolerant Computing on the Grid: What are My Options?

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A Monitoring Sensor Management System for Grid Environments

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Robust Resource Management for Metacomputers

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fault-tolerant grid resource management infrastructure

Neural, Parallel & Scientific Computations - Special issue: Grid computing

A survey of job scheduling in grids

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a scalable and fault-tolerant job scheduling framework for grid computing. The proposed framework loosely couples a dynamic job scheduling approach with the hybrid replications approach to schedule jobs efficiently while at the same time providing fault-tolerance. The novelty of the proposed framework is that it uses passive replication approach under high system load and active replication approach under low system loads. The switch between these two replication methods is also done dynamically and transparently.