Performance prediction of parallel systems with scalable specifications—methodology and case study
ACM SIGMETRICS Performance Evaluation Review
The grid
Byzantine generals in action: implementing fail-stop processors
ACM Transactions on Computer Systems (TOCS)
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
A Scheduling Model for Grid Computing Systems
GRID '01 Proceedings of the Second International Workshop on Grid Computing
Fault Tolerant Wide-Area Parallel Computing
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Effective Metacomputing using LSF MultiCluster
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Fault Tolerant Computing on the Grid: What are My Options?
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A Monitoring Sensor Management System for Grid Environments
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Robust Resource Management for Metacomputers
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
GridWorkflow: A Flexible Failure Handling Framework for the Grid
HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fault-tolerant grid resource management infrastructure
Neural, Parallel & Scientific Computations - Special issue: Grid computing
A survey of job scheduling in grids
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Hi-index | 0.00 |
In this paper, we propose a scalable and fault-tolerant job scheduling framework for grid computing. The proposed framework loosely couples a dynamic job scheduling approach with the hybrid replications approach to schedule jobs efficiently while at the same time providing fault-tolerance. The novelty of the proposed framework is that it uses passive replication approach under high system load and active replication approach under low system loads. The switch between these two replication methods is also done dynamically and transparently.