Optimizing jobs timeouts on clusters and production grids

Authors:
Tristan Glatard; Xavier
Affiliations:
CNRS, I3S unit, France;INRIA Sophia-Antipolis, Asclepios, France
Venue:
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Year:
2007

Citing 0
Cited 7

A Service-Oriented Architecture enabling dynamic service grouping for optimizing distributed workflow execution

Future Generation Computer Systems
Modeling user submission strategies on production grids

Proceedings of the 18th ACM international symposium on High performance distributed computing
Modeling the latency on production grids with respect to the execution context

Parallel Computing
Two experiments with application-level quality of service on the EGEE grid

Proceedings of the 2nd workshop on Grids meets autonomic computing
Modelling pilot-job applications on production grids

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Algorithms and mechanisms for procuring services with uncertain durations using redundancy

Artificial Intelligence
A survey of task mapping on production grids

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a method to optimize the timeout value of computing jobs. It relies on a model of the job execution time that considers the job management system latency through a random variable. It also takes into account a proportion of outliers to model either reliable clusters or production grids characterized by faults causing jobs loss. Job management systems are first studied considering classical distributions. Different behaviors are exhibited, depending on the weight of the tail of the distribution and on the amount of outliers. Experimental results are then shown based on the latency distribution and outlier ratios measured on the EGEE grid infrastructure1. Those results show that using the optimal timeout value provided by our method reduces the impact of outliers and leads to a 1.36 speed-up even for reliable systems without outliers.