File grouping for scientific data management: lessons from experimenting with real traces

Authors:
Shyamala Doraimani;Adriana Iamnitchi
Affiliations:
University of South Florida, Tampa, FL, USA;University of South Florida, Tampa, FL, USA
Venue:
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Year:
2008

Citing 28
Cited 7

Automated hoarding for mobile computers

Proceedings of the sixteenth ACM symposium on Operating systems principles
A large-scale study of file-system contents

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Workload characterization of a Web proxy in a cable modem environment

ACM SIGMETRICS Performance Evaluation Review
An end-to-end approach to globally scalable network storage

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Storage Management for Web Proxies

Proceedings of the General Track: 2002 USENIX Annual Technical Conference
Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Data Grids, Collections, and Grid Bricks

MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Demand-based document dissemination to reduce traffic and balance load in distributed information systems

SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Group-Based Management of Distributed File Caches

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Characteristics of WWW Client-based Traces

Characteristics of WWW Client-based Traces
Changes in Web Client Access Patterns: Characteristics and Caching Implications

Changes in Web Client Access Patterns: Characteristics and Caching Implications
Characterizing Reference Locality in the WWW

Characterizing Reference Locality in the WWW
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Choosing Replica Placement Heuristics for Wide-Area Systems

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
The LOCKSS peer-to-peer digital preservation system

ACM Transactions on Computer Systems (TOCS)
Optimal File-Bundle Caching Algorithms for Data-Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
The Livny and Plank-Beck Problems: Studies in Data Movement on the Computational Grid

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Taming aggressive replication in the Pangaea wide-area file system

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Active and logistical networking for grid computing: the e-Toile architecture

Future Generation Computer Systems
GRENCHMARK: A Framework for Analyzing, Testing, and Comparing Grids

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Planet scale software updates

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Embedded inodes and explicit grouping: exploiting disk bandwidth for small files

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Interest-aware information dissemination in small-world communities

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
DiskSeen: exploiting disk layout and access history to enhance I/O prefetch

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
File caching in data intensive scientific applications on data-grids

DMG 2005 Proceedings of the First VLDB conference on Data Management in Grids
Scheduling file transfers for data-intensive jobs on heterogeneous clusters

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
The characteristics and performance of groups of jobs in grids

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

File Clustering Based Replication Algorithm in a Grid Environment

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Workload characterization in a high-energy data grid and impact on resource management

Cluster Computing
SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A data placement strategy in scientific cloud workflows

Future Generation Computer Systems
Efficiently identifying working sets in block I/O streams

Proceedings of the 4th Annual International Conference on Systems and Storage
Graph-Cut Based Coscheduling Strategy Towards Efficient Execution of Scientific Workflows in Collaborative Cloud Environments

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Experiences with 100Gbps network applications

Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date

Quantified Score

Hi-index	0.00

Visualization

Abstract

The analysis of data usage in a large set of real traces from a high-energy physics collaboration revealed the existence of an emergent grouping of files that we coined "filecules". This paper presents the benefits of using this file grouping for prestaging data and compares it with previously proposed file grouping techniques along a range of performance metrics. Our experiments with real workloads demonstrate that filecule grouping is a reliable and useful abstraction for data management in science Grids; that preserving time locality for data prestaging is highly recommended; that job reordering with respect to data availability has significant impact on throughput; and finally, that a relatively short history of traces is a good predictor for filecule grouping. Our experimental results provide lessons for workload modeling and suggest design guidelines for data management in data-intensive resource-sharing environments.