Data pipelines: enabling large scale multi-protocol data transfers

Authors:
Tevfik Kosar;George Kola;Miron Livny
Affiliations:
University of Wisconsin-Madison, Madison WI;University of Wisconsin-Madison, Madison WI;University of Wisconsin-Madison, Madison WI
Venue:
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Year:
2004

Citing 9
Cited 0

The Globus toolkit

The grid
End-to-end internet packet dynamics

IEEE/ACM Transactions on Networking (TON)
End-to-end arguments in system design

ACM Transactions on Computer Systems (TOCS)
The SDSC storage resource broker

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing

MSS '01 Proceedings of the Eighteenth IEEE Symposium on Mass Storage Systems and Technologies
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
The Kangaroo Approach to Data Movement on the Grid

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Linear network coding

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collaborating users need to move terabytes of data among their sites, often involving multiple protocols. This process is very fragile and involves considerable human involvement to deal with failures. In this work, we propose data pipelines, an automated system for transferring data among collaborating sites. It speaks multiple protocols, has sophisticated flow control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to transfer three terabytes of DPOSS data from SRB mass storage server at San Diego Supercomputing Center to UniTree mass storage at NCSA. The whole process did not require any human intervention and the data pipeline recovered automatically from various network, storage system, software and hardware failures.