Data placement for scientific applications in distributed environments

  • Authors:
  • Ann Chervenak;Ewa Deelman;Miron Livny;Mei-Hui Su;Rob Schuler;Shishir Bharathi;Gaurang Mehta;Karan Vahi

  • Affiliations:
  • USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;Computer Science Department, University of Wisconsin Madison, WI53706-1685, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA

  • Venue:
  • GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific applications often perform complex computational analyses that consume and produce large data sets. We are concerned with data placement policies that distribute data in ways that are advantageous for application execution, for example, by placing data sets so that they may be staged into or out of computations efficiently or by replicating them for improved performance and reliability. In particular, we propose to study the relationship between data placement services and workflow management systems. In this paper, we explore the interactions between two services used in large-scale science today. We evaluate the benefits of prestaging data using the Data Replication Service versus using the native data stage-in mechanisms of the Pegasus workflow management system. We use the astronomy application, Montage, for our experiments and modify it to study the effect of input data size on the benefits of data prestaging. As the size of input data sets increases, prestaging using a data placement service can significantly improve the performance of the overall analysis.