/scratch as a cache: rethinking HPC center scratch storage

Authors:
Henry M. Monti;Ali R. Butt;Sudharshan S. Vazhkudai
Affiliations:
Virginia Tech., Blacksburg, VA, USA;Virginia Tech., Blacksburg, VA, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 23rd international conference on Supercomputing
Year:
2009

Citing 17
Cited 2

Informed prefetching and caching

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling

ACM Transactions on Computer Systems (TOCS)
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
ARIMA time series modeling and forecasting for adaptive I/O prefetching

ICS '01 Proceedings of the 15th international conference on Supercomputing
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Integrated prefetching and caching in single and parallel disk systems

Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures
A Network-Aware Distributed Storage Cache for Data Intensive Environments

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
The parallel I/O architecture of the high-performance storage system (HPSS)

MSS '95 Proceedings of the 14th IEEE Symposium on Mass Storage Systems
The Kangaroo Approach to Data Movement on the Grid

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
ARC: A Self-Tuning, Low Overhead Replacement Cache

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Coupling prefix caching and collective downloads for remote dataset access

Proceedings of the 20th annual international conference on Supercomputing
Explicit control a batch-aware distributed file system

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Program-counter-based pattern classification in buffer caching

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Optimizing center performance through coordinated data staging, scheduling and recovery

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Timely offloading of result-data in HPC centers

Proceedings of the 22nd annual international conference on Supercomputing

Case studies in storage access by loosely coupled petascale applications

Proceedings of the 4th Annual Workshop on Petascale Data Storage
Accelerating parallel analysis of scientific simulation data via Zazen

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

To sustain emerging data-intensive scientific applications, High Performance Computing (HPC) centers invest a notable fraction of their operating budget on a specialized fast storage system, scratch space, which is designed for storing the data of currently running and soon-to-run HPC jobs. Instead, it is often used as a standard file system, wherein users arbitrarily store their data, without any consideration to the center's overall performance. To remedy this, centers periodically scan the scratch in an attempt to purge transient and stale data.This practice of supporting a cache workload using a file system and disjoint tools for staging and purging results in suboptimal use of the scratch space. In this paper, we address the above issues by proposing a new perspective, where the HPC scratch space is treated as a cache, and data population, retention, and eviction tools are integrated with scratch management. In our approach, data is moved to the scratch space only when it is needed, and unneeded data is removed as soon as possible. We also design a new job-workflow-aware caching policy that leverages user-supplied hints for managing the cache. Our evaluation using three-year job logs from the Jaguar supercomputer, shows that compared to the widely-used purge approach, workflow-aware caching optimizes scratch utilization by reducing the average amount of data read by 9.3%, and by reducing job scheduling delays associated with data staging, on average, by 282.0%.