GRENCHMARK: A Framework for Analyzing, Testing, and Comparing Grids
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A Failure-Aware Scheduling Strategy in Large-Scale Cluster System
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Job Failure Analysis and Its Implications in a Large-Scale Production Grid
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
The Failure-rate Aware Scheduling Policies for Large-scale Cluster Systems
PDCAT '06 Proceedings of the Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Adaptive Fault Management of Parallel Applications for High-Performance Computing
IEEE Transactions on Computers
Resource Availability Prediction for Improved Grid Scheduling
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Fault-aware scheduling for Bag-of-Tasks applications on Desktop Grids
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Measuring the Performance and Reliability of Production Computational Grids
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Fault-Aware Runtime Strategies for High-Performance Computing
IEEE Transactions on Parallel and Distributed Systems
Reliability-aware resource allocation in HPC systems
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Trace-based evaluation of job runtime and queue wait time predictions in grids
Proceedings of the 18th ACM international symposium on High performance distributed computing
Reliability challenges in large systems
Future Generation Computer Systems
Pro-active failure handling mechanisms for scheduling in grid computing environments
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
As compared with supercomputers and PCs, the higher rate of unsuccessful job execution in today's distributed and large systems like clusters and grids is a significant reason behind squandering of their resources. Although many approaches have been proposed in order to make these environments more fault tolerant, their great overhead convinces the researchers to look for preventive methods. In this work, we employ a job futurity predictor to manage the arriving jobs efficiently. To this end, a novel meta-scheduler sub-component called Job Submission Manager JSM is proposed. The main role of JSM is to filter the incoming jobs according to some parameters such as current system load, job failure probability. The experimental results based on two different modelling approaches indicate that this managing component can effectively influence the system throughput and increase the utilisation of computing resources.