Optimizing center performance through coordinated data staging, scheduling and recovery

Authors:
Zhe Zhang;Chao Wang;Sudharshan S. Vazhkudai;Xiaosong Ma;Gregory G. Pike;John W. Cobb;Frank Mueller
Affiliations:
North Carolina State University;North Carolina State University;Oak Ridge National Laboratory;North Carolina State University and Oak Ridge National Laboratory;Oak Ridge National Laboratory;Oak Ridge National Laboratory;North Carolina State University
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 18
Cited 3

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
GASS: a data movement and access service for wide area computing systems

Proceedings of the sixth workshop on I/O in parallel and distributed systems
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Dynamically forecasting network performance using the Network Weather Service

Cluster Computing
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Predicting Sporadic Grid Data Transfers

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The parallel I/O architecture of the high-performance storage system (HPSS)

MSS '95 Proceedings of the 14th IEEE Symposium on Mass Storage Systems
The Kangaroo Approach to Data Movement on the Grid

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Storage resource managers: essential components for the Grid

Grid resource management
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Co-scheduling of computation and data on computer clusters

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Coupling prefix caching and collective downloads for remote dataset access

Proceedings of the 20th annual international conference on Supercomputing
Explicit control a batch-aware distributed file system

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Object-based storage

IEEE Communications Magazine

Timely offloading of result-data in HPC centers

Proceedings of the 22nd annual international conference on Supercomputing
DIMM: a distributed metadata management for data-intensive HPC environments

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
/scratch as a cache: rethinking HPC center scratch storage

Proceedings of the 23rd international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage.