The grid
Grids, the TeraGrid, and Beyond
Computer
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
Accelerating large-scale data exploration through data diffusion
DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
The Pilot Way to Grid Resources Using glideinWMS
CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 02
Case studies in storage access by loosely coupled petascale applications
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Scaling up workflow-based applications
Journal of Computer and System Sciences
Data-intensive CyberShake computations on an opportunistic cyberinfrastructure
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Experiences Using GlideinWMS and the Corral Frontend across Cyberinfrastructures
ESCIENCE '11 Proceedings of the 2011 IEEE Seventh International Conference on eScience
Coasters: Uniform Resource Provisioning and Access for Clouds and Grids
UCC '11 Proceedings of the 2011 Fourth IEEE International Conference on Utility and Cloud Computing
Swift: A language for distributed parallel scripting
Parallel Computing
Hi-index | 0.00 |
In this paper, we address the challenges of reducing the time-to-solution of the data intensive earthquake simulation workflow "CyberShake" by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed resources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O characteristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of computation (over HPC and distributed resources) and job clustering. We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources. More complete performance results will be presented in the final submission of this paper.