The workload on parallel supercomputers: modeling the characteristics of rigid jobs
Journal of Parallel and Distributed Computing
A Framework for Executing Long Running Jobs in Grid Environments
HPCS '08 Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications
Analysis of DNA sequence transformations on grids
Journal of Parallel and Distributed Computing
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Scheduling malleable applications in multicluster systems
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
An integrated framework for performance-based optimization of scientific workflows
Proceedings of the 18th ACM international symposium on High performance distributed computing
A metascalable computing framework for large spatiotemporal-scale atomistic simulations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Hi-index | 0.01 |
While computational grids with multiple batch systems (batch grids) have been used for efficient executions of loosely-coupled and workflow-based parallel applications, they can also be powerful infrastructures for executing long-running multi-component parallel applications. In this paper, we have constructed a generic middleware framework for executing long-running multi-component applications with execution times much greater than execution time limits of batch queues. Our framework coordinates the distribution, execution, migration and restart of the components of the application on the multiple queues, where the component jobs of the different queues can have different queue waiting and startup times. We have used our framework with a foremost long-running multi-component application for climate modeling, the Community Climate System Model (CCSM). We have performed real multiple-site CCSM runs for 6.5 days of wallclock time spanning three sites with four queues and emulated external workloads. Our experiments indicate that multi-site executions can lead to good throughput of application execution.