A job submission manager for large-scale distributed systems based on job futurity predictor

Authors:
Hamid Saadatfar;Hossein Deldari
Affiliations:
Parallel and Distributed Processing Lab, Computer Engineering Department, Faculty of Engineering, Ferdowsi University of Mashhad, Azadi Sq., Mashhad, Khorasan Razavi, P.O. Box 91775-1111, Iran;Parallel and Distributed Processing Lab, Computer Engineering Department, Faculty of Engineering, Ferdowsi University of Mashhad, Azadi Sq., Mashhad, Khorasan Razavi, P.O. Box 91775-1111, Iran
Venue:
International Journal of Grid and Utility Computing
Year:
2014

Citing 17
Cited 0

GRENCHMARK: A Framework for Analyzing, Testing, and Comparing Grids

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A Failure-Aware Scheduling Strategy in Large-Scale Cluster System

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Job Failure Analysis and Its Implications in a Large-Scale Production Grid

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
The Failure-rate Aware Scheduling Policies for Large-scale Cluster Systems

PDCAT '06 Proceedings of the Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Adaptive Fault Management of Parallel Applications for High-Performance Computing

IEEE Transactions on Computers
Resource Availability Prediction for Improved Grid Scheduling

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Fault-aware scheduling for Bag-of-Tasks applications on Desktop Grids

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Measuring the Performance and Reliability of Production Computational Grids

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Fault-Aware Runtime Strategies for High-Performance Computing

IEEE Transactions on Parallel and Distributed Systems
Reliability-aware resource allocation in HPC systems

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Trace-based evaluation of job runtime and queue wait time predictions in grids

Proceedings of the 18th ACM international symposium on High performance distributed computing
Reliability challenges in large systems

Future Generation Computer Systems
Pro-active failure handling mechanisms for scheduling in grid computing environments

Journal of Parallel and Distributed Computing
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As compared with supercomputers and PCs, the higher rate of unsuccessful job execution in today's distributed and large systems like clusters and grids is a significant reason behind squandering of their resources. Although many approaches have been proposed in order to make these environments more fault tolerant, their great overhead convinces the researchers to look for preventive methods. In this work, we employ a job futurity predictor to manage the arriving jobs efficiently. To this end, a novel meta-scheduler sub-component called Job Submission Manager JSM is proposed. The main role of JSM is to filter the incoming jobs according to some parameters such as current system load, job failure probability. The experimental results based on two different modelling approaches indicate that this managing component can effectively influence the system throughput and increase the utilisation of computing resources.