Job and data clustering for aggregate use of multiple production cyberinfrastructures

  • Authors:
  • Ketan Maheshwari;Allan Espinosa;Daniel S. Katz;Michael Wilde;Zhao Zhang;Ian Foster;Scott Callaghan;Phillip Maechling

  • Affiliations:
  • Argonne National Laboratory, Argonne, IL, USA;University of Chicago, Chicago, USA;University of Chicago & Argonne National Laboratory, Chicago, USA;University of Chicago & Argonne National Laboratory, Chicago, USA;University of Chicago, Chicago, USA;Argonne National Laboratory, Argonne, USA;University of Southern California, Los Angeles, USA;University of Southern California, Los Angeles, USA

  • Venue:
  • Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we address the challenges of reducing the time-to-solution of the data intensive earthquake simulation workflow "CyberShake" by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed resources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O characteristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of computation (over HPC and distributed resources) and job clustering. We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources. More complete performance results will be presented in the final submission of this paper.