A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
GASS: a data movement and access service for wide area computing systems
Proceedings of the sixth workshop on I/O in parallel and distributed systems
Future Generation Computer Systems - Special issue on metacomputing
Predicting the Performance of Wide Area Data Transfers
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Replica Selection in the Globus Data Grid
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Predicting Sporadic Grid Data Transfers
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Kangaroo Approach to Data Movement on the Grid
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Bullet: high bandwidth data dissemination using an overlay mesh
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Downloading Replicated, Wide-Area Files - A Framework and Empirical Evaluation
NCA '04 Proceedings of the Network Computing and Applications, Third IEEE International Symposium
The LEAD Portal: a TeraGrid gateway and application service architecture: Research Articles
Concurrency and Computation: Practice & Experience - Science Gateways—Common Community Interfaces to Grid Resources
Concurrency and Computation: Practice & Experience - Science Gateways—Common Community Interfaces to Grid Resources
Reliability and security in the CoDeeN content distribution network
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Shark: scaling file servers via cooperative caching
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Using random subsets to build scalable network services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Scale and performance in the CoBlitz large-file distribution service
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Optimizing center performance through coordinated data staging, scheduling and recovery
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?
Euro-Par '07 Proceedings of the 13th European international conference on Parallel Processing
/scratch as a cache: rethinking HPC center scratch storage
Proceedings of the 23rd international conference on Supercomputing
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
On reducing energy management delays in disks
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
High performance computing is facing an exponential growth in job output dataset sizes. This implies a significant commitment of supercomputing center resources---most notably, precious scratch space---in handling data staging and offloading. However, the scratch area is typically managed using simple "purge policies", without sophisticated "end-user data services" that are required to balance center's resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. We propose a robust framework for the timely, decentralized offload of result data, addressing the aforementioned significant gaps in extant direct-transfer-based offloading. The decentralized offload is achieved using an overlay of user-specified intermediate nodes and well known landmark nodes. These nodes serve as a means both to provide multiple data-flow paths, thereby maximizing bandwidth as well as provide fail-over capabilities for the offload. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent), and our evaluation shows that the offloading times can be significantly reduced (90.2% for a 2.1 GB file), while also meeting center-user Service Level Agreements.