Timely offloading of result-data in HPC centers

Authors:
Henry M. Monti;Ali R. Butt;Sudharshan S. Vazhkudai
Affiliations:
Virginia Polytechnic Institute and State University, Blacksburg, VA, USA;Virginia Polytechnic Institute and State University, Blacksburg, VA, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 22nd annual international conference on Supercomputing
Year:
2008

Citing 18
Cited 3

A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
GASS: a data movement and access service for wide area computing systems

Proceedings of the sixth workshop on I/O in parallel and distributed systems
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
Predicting the Performance of Wide Area Data Transfers

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Replica Selection in the Globus Data Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Predicting Sporadic Grid Data Transfers

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Kangaroo Approach to Data Movement on the Grid

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Bullet: high bandwidth data dissemination using an overlay mesh

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Downloading Replicated, Wide-Area Files - A Framework and Empirical Evaluation

NCA '04 Proceedings of the Network Computing and Applications, Third IEEE International Symposium
The LEAD Portal: a TeraGrid gateway and application service architecture: Research Articles

Concurrency and Computation: Practice & Experience - Science Gateways—Common Community Interfaces to Grid Resources
The Neutron Science TeraGrid Gateway: a TeraGrid science gateway to support the Spallation Neutron Source: Research Articles

Concurrency and Computation: Practice & Experience - Science Gateways—Common Community Interfaces to Grid Resources
Reliability and security in the CoDeeN content distribution network

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Shark: scaling file servers via cooperative caching

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Using random subsets to build scalable network services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Scale and performance in the CoBlitz large-file distribution service

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Optimizing center performance through coordinated data staging, scheduling and recovery

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

Euro-Par '07 Proceedings of the 13th European international conference on Parallel Processing

/scratch as a cache: rethinking HPC center scratch storage

Proceedings of the 23rd international conference on Supercomputing
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
On reducing energy management delays in disks

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

High performance computing is facing an exponential growth in job output dataset sizes. This implies a significant commitment of supercomputing center resources---most notably, precious scratch space---in handling data staging and offloading. However, the scratch area is typically managed using simple "purge policies", without sophisticated "end-user data services" that are required to balance center's resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. We propose a robust framework for the timely, decentralized offload of result data, addressing the aforementioned significant gaps in extant direct-transfer-based offloading. The decentralized offload is achieved using an overlay of user-specified intermediate nodes and well known landmark nodes. These nodes serve as a means both to provide multiple data-flow paths, thereby maximizing bandwidth as well as provide fail-over capabilities for the offload. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent), and our evaluation shows that the offloading times can be significantly reduced (90.2% for a 2.1 GB file), while also meeting center-user Service Level Agreements.