/scratch as a cache: rethinking HPC center scratch storage

  • Authors:
  • Henry M. Monti;Ali R. Butt;Sudharshan S. Vazhkudai

  • Affiliations:
  • Virginia Tech., Blacksburg, VA, USA;Virginia Tech., Blacksburg, VA, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA

  • Venue:
  • Proceedings of the 23rd international conference on Supercomputing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

To sustain emerging data-intensive scientific applications, High Performance Computing (HPC) centers invest a notable fraction of their operating budget on a specialized fast storage system, scratch space, which is designed for storing the data of currently running and soon-to-run HPC jobs. Instead, it is often used as a standard file system, wherein users arbitrarily store their data, without any consideration to the center's overall performance. To remedy this, centers periodically scan the scratch in an attempt to purge transient and stale data.This practice of supporting a cache workload using a file system and disjoint tools for staging and purging results in suboptimal use of the scratch space. In this paper, we address the above issues by proposing a new perspective, where the HPC scratch space is treated as a cache, and data population, retention, and eviction tools are integrated with scratch management. In our approach, data is moved to the scratch space only when it is needed, and unneeded data is removed as soon as possible. We also design a new job-workflow-aware caching policy that leverages user-supplied hints for managing the cache. Our evaluation using three-year job logs from the Jaguar supercomputer, shows that compared to the widely-used purge approach, workflow-aware caching optimizes scratch utilization by reducing the average amount of data read by 9.3%, and by reducing job scheduling delays associated with data staging, on average, by 282.0%.