File-access patterns of data-intensive workflow applications and their implications to distributed filesystems

Authors:
Takeshi Shibata;SungJun Choi;Kenjiro Taura
Affiliations:
University of Tokyo;University of Tokyo;University of Tokyo
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 21
Cited 5

Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
A New Approach for Speeding Up Enumeration Algorithms

ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation
Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Grid Datafarm Architecture for Petascale Data Intensive Computing

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Japanese case structure analysis by unsupervised construction of a case frame dictionary

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Taverna: a tool for the composition and enactment of bioinformatics workflows

Bioinformatics
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Scheduling strategies for mapping application workflows onto the grid

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
DataStager: scalable data staging services for petascale applications

Proceedings of the 18th ACM international symposium on High performance distributed computing
The quest for scalable support of data-intensive workloads in distributed systems

Proceedings of the 18th ACM international symposium on High performance distributed computing
GMount: An Ad Hoc and Locality-Aware Distributed File System by Using SSH and FUSE

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
The effect of corpus size on case frame acquisition for discourse analysis

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
ParaTrac: a fine-grained profiler for data-intensive workflows

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ParaTrac: a fine-grained profiler for data-intensive workflows

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Integrated data placement and task assignment for scientific workflows in clouds

Proceedings of the fourth international workshop on Data-intensive distributed computing
A Workflow-Aware Storage System: An Opportunity Study

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Dynamic cost verification for cloud applications

Proceedings of the 2012 Workshop on Dynamic Analysis
Predicting intermediate storage performance for workflow applications

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies five real-world data intensive workflow applications in the fields of natural language processing, astronomy image analysis, and web data analysis. Data intensive workflows are increasingly becoming important applications for cluster and Grid environments. They open new challenges to various components of workflow execution environments including job dispatchers, schedulers, file systems, and file staging tools. The keys to achieving high performance are efficient data sharing among executing hosts and locality-aware scheduling that reduces the amount of data transfer. While much work has been done on scheduling workflows, many of them use synthetic or random workload. As such, their impacts on real workloads are largely unknown. Understanding characteristics of real-world workflow applications is a required step to promote research in this area. To this end, we analyse real-world workflow applications focusing on their file access patterns and summarize their implications to schedulers and file system/staging designs.