Probabilistic and Dynamic Optimization of Job Partitioning on a Grid Infrastructure
PDP '06 Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Optimizing jobs timeouts on clusters and production grids
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Impact of the execution context on Grid job performances
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Improvement of Task Retrieval Performance Using AMGA in a Large-Scale Virtual Screening
NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 01
Towards Making BOINC and EGEE Interoperable
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Resource Provisioning Options for Large-Scale Scientific Workflows
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Modeling Job Arrival Process with Long Range Dependence and Burstiness Characteristics
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization
Job Scheduling Strategies for Parallel Processing
Processing moldable tasks on the grid: Late job binding with lightweight user-level overlay
Future Generation Computer Systems
A model of pilot-job resource provisioning on production grids
Parallel Computing
A survey of task mapping on production grids
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Pilot-job systems have emerged as a computation paradigm to cope with heterogeneity of production grids, greatly improving fault ratios and latency. Tools like DIANE, WISDOM-II, ToPoS and Condor glideIns are now being widely adopted to conduct large-scale experiments on such platforms. However, a model of pilot-job applications is still lacking, making it difficult to determine submission parameters such as the number of pilots to submit to achieve a given performance level. The variability of production conditions and the heterogeneity of the underlying middleware and infrastructure further complicates this issue. This paper presents a performance model for pilot-job applications running on production grids. Based on a probabilistic modelling, we derive statistics about the number of available pilots along time and the makespan of the application given the number of submitted pilots. Results obtained on a radiotherapy application running on the EGEE production grid show that the model is accurate enough to correctly describe the behavior of the application, setting the basis for further optimization strategies.