Data placement for scientific applications in distributed environments

Authors:
Ann Chervenak;Ewa Deelman;Miron Livny;Mei-Hui Su;Rob Schuler;Shishir Bharathi;Gaurang Mehta;Karan Vahi
Affiliations:
USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;Computer Science Department, University of Wisconsin Madison, WI53706-1685, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA;USC Information Sciences Institute, Marina Del Rey, CA 90292, USA
Venue:
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Year:
2007

Citing 17
Cited 9

Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
Giggle: a framework for constructing scalable replica location services

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Performance and Scalability of a Replica Location Service

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications
Scheduling of scientific workflows in the ASKALON grid environment

ACM SIGMOD Record
The Globus Striped GridFTP Framework and Server

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A framework for reliable and efficient data placement in distributed computing systems

Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
What makes workflows work in an opportunistic environment?: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Wide Area Data Replication for Scientific Collaborations

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Efficient replica maintenance for distributed storage systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Scheduling strategies for mapping application workflows onto the grid

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
Advance reservation policies for workflows

JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing

A data placement service for petascale applications

PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
A data placement strategy in scientific cloud workflows

Future Generation Computer Systems
On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems

Journal of Parallel and Distributed Computing
A MapReduce workflow system for architecting scientific data intensive applications

Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
Workflow overhead analysis and optimizations

Proceedings of the 6th workshop on Workflows in support of large-scale science
A data dependency based strategy for intermediate data storage in scientific cloud workflow systems

Concurrency and Computation: Practice & Experience
A Workflow-Aware Storage System: An Opportunity Study

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A classification of file placement and replication methods on grids

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific applications often perform complex computational analyses that consume and produce large data sets. We are concerned with data placement policies that distribute data in ways that are advantageous for application execution, for example, by placing data sets so that they may be staged into or out of computations efficiently or by replicating them for improved performance and reliability. In particular, we propose to study the relationship between data placement services and workflow management systems. In this paper, we explore the interactions between two services used in large-scale science today. We evaluate the benefits of prestaging data using the Data Replication Service versus using the native data stage-in mechanisms of the Pegasus workflow management system. We use the astronomy application, Montage, for our experiments and modify it to study the effect of input data size on the benefits of data prestaging. As the size of input data sets increases, prestaging using a data placement service can significantly improve the performance of the overall analysis.