Recovering transient data: automated on-demand data reconstruction and offloading for supercomputers

Authors:
Sudharshan Vazhkudai;Xiaosong Ma
Affiliations:
Oak Ridge National Laboratory;North Carolina State University and Oak Ridge National Laboratory
Venue:
ACM SIGOPS Operating Systems Review
Year:
2007

Citing 10
Cited 0

OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Active buffering plus compressed migration: an integrated solution to parallel simulations' data transport needs

ICS '02 Proceedings of the 16th international conference on Supercomputing
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The Globus Striped GridFTP Framework and Server

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
FreeLoader: Scavenging Desktop Storage Resources for Scientific Data

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Coupling prefix caching and collective downloads for remote dataset access

Proceedings of the 20th annual international conference on Supercomputing
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has become a national priority to build and use PetaFlop supercomputers. The dependability of such large systems has been recognized as a key issue that can impact their usability. Even with smaller, existing machines, failures are the norm rather than an exception. Research has shown that storage systems are the primary source of faults leading to supercomputer unavailability. In this paper, we envision two mechanisms, namely on-demand data reconstruction and eager data offloading, to address the availability of job input/output data. These two techniques aim to allow parallel jobs and post-job processing tools to continue execution despite storage system failures in supercomputers. Fundamental to both approaches is the definition and acquisition of recovery-related parallel file system metadata, which is then coupled with transparent remote data accesses. Our approach attempts to maximize the utilization of precious supercomputer resources by improving the accessibility of transient job data. Further, the proposed methods are best-effort in nature and complement existing file system recovery schemes, which are designed for persistent data. Several of our previous studies help in demonstrating the feasibility of the proposed approaches.