Pro-active failure handling mechanisms for scheduling in grid computing environments

Authors:
B. T. Benjamin Khoo;Bharadwaj Veeravalli
Affiliations:
National University of Singapore, Department of Electrical and Computer Engineering, Singapore;National University of Singapore, Department of Electrical and Computer Engineering, Singapore
Venue:
Journal of Parallel and Distributed Computing
Year:
2010

Citing 15
Cited 2

Improving Performance via Computational Replication on a Large-Scale Computational Grid

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Distributed Job Scheduling on Computational Grids Using Multiple Simultaneous Requests

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Measurement of Failure Rate in Widely Distributed Software

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Faults in Grids: Why are they so bad and What can be done about it?

GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Volunteer Availability based Fault Tolerant Scheduling Mechanism in Desktop Grid Computing Environment

NCA '04 Proceedings of the Network Computing and Applications, Third IEEE International Symposium
Cluster Computing and Grid 2005 Works in Progress

IEEE Distributed Systems Online
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A resource manager for optimal resource selection and fault tolerance service in Grids

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A Co-ordinate Based Resource Allocation Strategy for Grid Environments

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
User group-based workload analysis and modelling

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
A multi-dimensional scheduling scheme in a Grid computing environment

Journal of Parallel and Distributed Computing
Failure Prediction in Computational Grids

ANSS '07 Proceedings of the 40th Annual Simulation Symposium
Executing Large Parameter Sweep Applications on a Multi-VO Testbed

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Robust parallel job scheduling infrastructure for service-oriented grid computing systems

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV

Enhanced Dynamic Hierarchical Replication and Weighted Scheduling Strategy in Data Grid

Journal of Parallel and Distributed Computing
A job submission manager for large-scale distributed systems based on job futurity predictor

International Journal of Grid and Utility Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider designing pro-active failure handling strategies for grid environments. These strategies estimate the availability of resources in the Grid, and also preemptively calculate the expected long term capacity of the Grid. Using these strategies, we create modified versions of the backfill and replication algorithms to include all three pro-active strategies to ascertain each of their effectiveness in the prevention of job failures during execution. Also, we extend our earlier work on a co-ordinate based allocation strategy. The extended algorithm also shows continual improvement when operating under the same execution environment. In our experiments, we compare these enhanced algorithms to their original forms, and show that pro-active failure handling is able to, in some cases, avoid all job failures during execution. Also, we show that NSA provides the best balance of enhanced throughput and job failures during execution of the algorithms we have considered.