Job and data clustering for aggregate use of multiple production cyberinfrastructures

Authors:
Ketan Maheshwari;Allan Espinosa;Daniel S. Katz;Michael Wilde;Zhao Zhang;Ian Foster;Scott Callaghan;Phillip Maechling
Affiliations:
Argonne National Laboratory, Argonne, IL, USA;University of Chicago, Chicago, USA;University of Chicago & Argonne National Laboratory, Chicago, USA;University of Chicago & Argonne National Laboratory, Chicago, USA;University of Chicago, Chicago, USA;Argonne National Laboratory, Argonne, USA;University of Southern California, Los Angeles, USA;University of Southern California, Los Angeles, USA
Venue:
Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date
Year:
2012

Citing 12
Cited 0

The Globus toolkit

The grid
Grids, the TeraGrid, and Beyond

Computer
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Accelerating large-scale data exploration through data diffusion

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows - Experiences from SCEC CyberShake

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
The Pilot Way to Grid Resources Using glideinWMS

CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 02
Case studies in storage access by loosely coupled petascale applications

Proceedings of the 4th Annual Workshop on Petascale Data Storage
Scaling up workflow-based applications

Journal of Computer and System Sciences
Data-intensive CyberShake computations on an opportunistic cyberinfrastructure

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Experiences Using GlideinWMS and the Corral Frontend across Cyberinfrastructures

ESCIENCE '11 Proceedings of the 2011 IEEE Seventh International Conference on eScience
Coasters: Uniform Resource Provisioning and Access for Clouds and Grids

UCC '11 Proceedings of the 2011 Fourth IEEE International Conference on Utility and Cloud Computing
Swift: A language for distributed parallel scripting

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the challenges of reducing the time-to-solution of the data intensive earthquake simulation workflow "CyberShake" by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed resources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O characteristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of computation (over HPC and distributed resources) and job clustering. We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources. More complete performance results will be presented in the final submission of this paper.