Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing
IEEE Transactions on Parallel and Distributed Systems
GPFS: A Shared-Disk File System for Large Computing Clusters
FAST '02 Proceedings of the Conference on File and Storage Technologies
A New Approach for Speeding Up Enumeration Algorithms
ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation
Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Grid Datafarm Architecture for Petascale Data Intensive Computing
CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Stork: Making Data Placement a First Class Citizen in the Grid
ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Japanese case structure analysis by unsupervised construction of a case frame dictionary
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
Task scheduling strategies for workflow-based applications in grids
CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Ceph: a scalable, high-performance distributed file system
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Scheduling strategies for mapping application workflows onto the grid
HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
Falkon: a Fast and Light-weight tasK executiON framework
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
DataStager: scalable data staging services for petascale applications
Proceedings of the 18th ACM international symposium on High performance distributed computing
The quest for scalable support of data-intensive workloads in distributed systems
Proceedings of the 18th ACM international symposium on High performance distributed computing
GMount: An Ad Hoc and Locality-Aware Distributed File System by Using SSH and FUSE
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
The effect of corpus size on case frame acquisition for discourse analysis
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
ParaTrac: a fine-grained profiler for data-intensive workflows
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
ParaTrac: a fine-grained profiler for data-intensive workflows
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Integrated data placement and task assignment for scientific workflows in clouds
Proceedings of the fourth international workshop on Data-intensive distributed computing
A Workflow-Aware Storage System: An Opportunity Study
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Dynamic cost verification for cloud applications
Proceedings of the 2012 Workshop on Dynamic Analysis
Predicting intermediate storage performance for workflow applications
PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
Hi-index | 0.00 |
This paper studies five real-world data intensive workflow applications in the fields of natural language processing, astronomy image analysis, and web data analysis. Data intensive workflows are increasingly becoming important applications for cluster and Grid environments. They open new challenges to various components of workflow execution environments including job dispatchers, schedulers, file systems, and file staging tools. The keys to achieving high performance are efficient data sharing among executing hosts and locality-aware scheduling that reduces the amount of data transfer. While much work has been done on scheduling workflows, many of them use synthetic or random workload. As such, their impacts on real workloads are largely unknown. Understanding characteristics of real-world workflow applications is a required step to promote research in this area. To this end, we analyse real-world workflow applications focusing on their file access patterns and summarize their implications to schedulers and file system/staging designs.