Bulk data transfer forecasts and the implications to grid scheduling

Authors:
Tobin Maginnis;Sudharshan Sankaran Vazhkudai
Affiliations:
-;-
Venue:
Bulk data transfer forecasts and the implications to grid scheduling
Year:
2003

Citing 0
Cited 2

A data transfer framework for large-scale science experiments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Moving huge scientific datasets over the Internet

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasingly, scientific discovery is driven by computationally intensive analyses of massive data-collections stored in a distributed fashion. Technology trends indicate that the rates at which networks and storage elements double in capacity or halve in price are approximately 8 and 12 months, respectively. This promising recent trend is propelling several scientific research groups to undertake projects never before foreseen—notable examples being high-energy physics experiments (CMS), studies of gravitational waves (LIGO), digital sky surveys (SDSS), SETI@Home, Folding@Home, and Evolution@Home, etc. The common denominator in all of these projects is the propensity to use distributed data stores, which has thrust to the forefront the need to efficiently manage data access in massively distributed communities. Data Grids provide an efficient environment to handle massively distributed data by building federations of high-end storage systems, providing secure access to remote resources, enabling resource discovery, data movement, and the like. This work addresses issues involved in the efficient selection and access of replicated data in Grid environments in the context of the Globus Toolkit™, building middleware that (1) selects datasets in highly replicated environments, enabling efficient scheduling of data transfer requests; (2) predicts transfer times of bulk wide-area data transfers using extensive statistical analysis; and (3) co-allocates bulk data transfer requests, enabling parallel downloads from mirrored sites. These efforts have demonstrated a decentralized data scheduling architecture, a set of forecasting tools that predict bandwidth availability within 15 percent error and co-allocation architecture, and heuristics that expedites data downloads by 100 percent.