Coupling prefix caching and collective downloads for remote dataset access

Authors:
Xiaosong Ma;Vincent W. Freeh;Tao Yang;Sudharshan S. Vazhkudai;Tyler A. Simon;Stephen L. Scott
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN
Venue:
Proceedings of the 20th annual international conference on Supercomputing
Year:
2006

Citing 23
Cited 3

Design and Evaluation of primitives for Parallel I/O

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Server-directed collective I/O in Panda

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The Galley parallel file system

Parallel Computing - Special double issue: parallel I/O
A digital fountain approach to reliable distribution of bulk data

Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
On implementing MPI-IO portably and with high performance

Proceedings of the sixth workshop on I/O in parallel and distributed systems
GASS: a data movement and access service for wide area computing systems

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Parallel I/O for high performance computing

Parallel I/O for high performance computing
Active buffering plus compressed migration: an integrated solution to parallel simulations' data transport needs

ICS '02 Proceedings of the 16th international conference on Supercomputing
Squirrel: a decentralized peer-to-peer web cache

Proceedings of the twenty-first annual symposium on Principles of distributed computing
PC-OPT: Optimal Offline Prefetching and Caching for Parallel I/O Systems

IEEE Transactions on Computers
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
A Network-Aware Distributed Storage Cache for Data Intensive Environments

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
The parallel I/O architecture of the high-performance storage system (HPSS)

MSS '95 Proceedings of the 14th IEEE Symposium on Mass Storage Systems
Enabling Network-Aware Applications

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Storage resource managers: essential components for the Grid

Grid resource management
Optimal File-Bundle Caching Algorithms for Data-Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Kosha: A Peer-to-Peer Enhancement for the Network File System

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
The entropia virtual machine for desktop grids

Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments
The Globus Striped GridFTP Framework and Server

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
FreeLoader: Scavenging Desktop Storage Resources for Scientific Data

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
RFS: efficient and flexible remote file access for MPI-IO

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
The Composite Endpoint Protocol (CEP): scalable endpoints for terabit flows

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02

Recovering transient data: automated on-demand data reconstruction and offloading for supercomputers

ACM SIGOPS Operating Systems Review
Optimizing center performance through coordinated data staging, scheduling and recovery

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
/scratch as a cache: rethinking HPC center scratch storage

Proceedings of the 23rd international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. End-users of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of wide-area data movement.Scientific data caches, however, traditionally cache entire data-sets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download over-head and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets.