A model of pilot-job resource provisioning on production grids

Authors:
Tristan Glatard;Sorina Camarasu-Pop
Affiliations:
CREATIS - CNRS UMR 5220 - INSERM U1044 - Université Lyon 1 - INSA Lyon, 69621 Villeurbanne, France;CREATIS - CNRS UMR 5220 - INSERM U1044 - Université Lyon 1 - INSA Lyon, 69621 Villeurbanne, France
Venue:
Parallel Computing
Year:
2011

Citing 17
Cited 0

Master/Slave Computing on the Grid

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems

Cluster Computing
DIRAC: A Scalable Lightweight Architecture for High Throughput Computing

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
A provisioning model and its comparison with best-effort for performance-cost optimization in grids

Proceedings of the 16th international symposium on High performance distributed computing
Definition, modelling and simulation of a grid computing scheduling system for high throughput computing

Future Generation Computer Systems
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
SimGrid: A Generic Framework for Large-Scale Distributed Experiments

UKSIM '08 Proceedings of the Tenth International Conference on Computer Modeling and Simulation
Improvement of Task Retrieval Performance Using AMGA in a Large-Scale Virtual Screening

NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 01
Towards Making BOINC and EGEE Interoperable

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Toward autonomic grids: analyzing the job flow with affinity streaming

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Modeling Job Arrival Process with Long Range Dependence and Burstiness Characteristics

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization

Job Scheduling Strategies for Parallel Processing
Modelling pilot-job applications on production grids

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Modeling resubmission in unreliable grids: the bottom-up approach

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
The master-slave paradigm with heterogeneous processors

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pilot-job systems emerged as a computation paradigm to cope with the heterogeneity of large-scale production grids, greatly reducing fault ratios and middleware overheads. They are now widely adopted to sustain the computation of scientific applications on such platforms. However, a model of pilot-job systems is still lacking, making it difficult to build realistic experimental setups for their study (e.g. simulators or controlled platforms). The variability of production conditions, background loads and resource characteristics further complicate this issue. This paper presents a model of pilot-job resource provisioning. Based on a probabilistic modeling of pilot submission and registration, the number of pilots registered to the application host and the makespan of a divisible-load application are derived. The model takes into account job failures and it does not make any assumption on the characteristics of the computing resources, on the scheduling algorithm or on the background load. Only a minimally invasive monitoring of the grid is required. The model is evaluated in production conditions, using logs acquired on a pilot-job server deployed in the biomed virtual organization of the European Grid Infrastructure. Experimental results show that the model is able to accurately describe the number of registered pilots along time periods ranging from a few hours to a few days and in different pilot submission conditions.