File caching in data intensive scientific applications on data-grids

Authors:
Ekow Otoo;Doron Rotem;Alexandru Romosan;Sridhar Seshadri
Affiliations:
Lawrence Berkeley National Laboratory, University of California, Berkeley, California;Lawrence Berkeley National Laboratory, University of California, Berkeley, California;Lawrence Berkeley National Laboratory, University of California, Berkeley, California;Leonard N. Stern School of Business, New York University, New York
Venue:
DMG 2005 Proceedings of the First VLDB conference on Data Management in Grids
Year:
2005

Citing 7
Cited 3

On-line file caching

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Lecture Notes on Bucket Algorithms

Lecture Notes on Bucket Algorithms
Impact of Admission and Cache Replacement Policies on Response Times of Jobs on Data Grids

CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments
Coordinating Simultaneous Caching of File Bundles from Tertiary Storage

SSDBM '00 Proceedings of the 12th International Conference on Scientific and Statistical Database Management
Optimized Management of Large-Scale Data Sets Stored on Tertiary Storage Systems

IEEE Distributed Systems Online
Using bitmap index for interactive exploration of large datasets

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
Cost-aware WWW proxy caching algorithms

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

File grouping for scientific data management: lessons from experimenting with real traces

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Workload characterization in a high-energy data grid and impact on resource management

Cluster Computing
Elastic Cloud Caches for Accelerating Service-Oriented Computations

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present some theoretical and experimental results of an important caching problem which arises frequently in data intensive scientific applications that are run in data-grids. Such applications often need to process several files simultaneously, i.e., the application runs only if all its needed files are present in some disk cache accessible to the compute resource of the application. The set of files requested by an application, all of which must be in cache for the application to run, is called a file-bundle. This requirement introduces the need for cache replacement algorithms that are based on file-bundles rather then individual files. We show that traditional caching algorithms such as Least Recently Used (LRU) and GreedyDual-Size (GDS) are not optimal in this case since they are not sensitive to file-bundles and may hold in the cache non-relevant combinations of files. We propose and analyze a new cache replacement algorithm specifically adapted to deal with file-bundles. Results of experimental studies of the new algorithm, using a disk cache simulation model under a wide range of conditions such as file request distributions, relative cache size, file size distribution, and incoming job queue size, show significant improvement over traditional caching algorithms such as GDS.