Adaptive Executions of Multi-Physics Coupled Applications on Batch Grids

Authors:
Sivagama Sundari Murugavel;Sathish S. Vadhiyar;Ravi S. Nanjundiah
Affiliations:
Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India;Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India;Centre for Atmospheric & Oceanic Sciences, Indian Institute of Science, Bangalore, India
Venue:
Journal of Grid Computing
Year:
2011

Citing 16
Cited 0

Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance

IPPS/SPDP '99/JSSPP '99 Proceedings of the Job Scheduling Strategies for Parallel Processing
Distributed Computing in a Heterogeneous Computing Environment

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Performance Study of Monitoring and Information Services for Distributed Systems

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
The workload on parallel supercomputers: modeling the characteristics of rigid jobs

Journal of Parallel and Distributed Computing
The Model Coupling Toolkit: A New Fortran90 Toolkit for Building Multiphysics Parallel Coupled Models

International Journal of High Performance Computing Applications
Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Mobile MPI programs in computational grids

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Predicting bounds on queuing delay for batch-scheduled parallel machines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Methods of inference and learning for performance modeling of parallel applications

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling Policies for Processor Coallocation in Multicluster Systems

IEEE Transactions on Parallel and Distributed Systems
A Framework for Executing Long Running Jobs in Grid Environments

HPCS '08 Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications
Scheduling malleable applications in multicluster systems

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Efficient Runtime Environment for Coupled Multi-physics Simulations: Dynamic Resource Allocation and Load-Balancing

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Autonomic management of application workflows on hybrid computing infrastructure

Scientific Programming - Science-Driven Cloud Computing
MPICH-GP: a Private-IP-Enabled MPI over grid environments

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Long running multi-physics coupled parallel applications have gained prominence in recent years. The high computational requirements and long durations of simulations of these applications necessitate the use of multiple systems of a Grid for execution. In this paper, we have built an adaptive middleware framework for execution of long running multi-physics coupled applications across multiple batch systems of a Grid. Our framework, apart from coordinating the executions of the component jobs of an application on different batch systems, also automatically resubmits the jobs multiple times to the batch queues to continue and sustain long running executions. As the set of active batch systems available for execution changes, our framework performs migration and rescheduling of components using a robust rescheduling decision algorithm. We have used our framework for improving the application throughput of a foremost long running multi-component application for climate modeling, the Community Climate System Model (CCSM). Our real multi-site experiments with CCSM indicate that Grid executions can lead to improved application throughput for climate models.