Timely offloading of result-data in HPC centers

  • Authors:
  • Henry M. Monti;Ali R. Butt;Sudharshan S. Vazhkudai

  • Affiliations:
  • Virginia Polytechnic Institute and State University, Blacksburg, VA, USA;Virginia Polytechnic Institute and State University, Blacksburg, VA, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA

  • Venue:
  • Proceedings of the 22nd annual international conference on Supercomputing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

High performance computing is facing an exponential growth in job output dataset sizes. This implies a significant commitment of supercomputing center resources---most notably, precious scratch space---in handling data staging and offloading. However, the scratch area is typically managed using simple "purge policies", without sophisticated "end-user data services" that are required to balance center's resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. We propose a robust framework for the timely, decentralized offload of result data, addressing the aforementioned significant gaps in extant direct-transfer-based offloading. The decentralized offload is achieved using an overlay of user-specified intermediate nodes and well known landmark nodes. These nodes serve as a means both to provide multiple data-flow paths, thereby maximizing bandwidth as well as provide fail-over capabilities for the offload. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent), and our evaluation shows that the offloading times can be significantly reduced (90.2% for a 2.1 GB file), while also meeting center-user Service Level Agreements.