A data dependency based strategy for intermediate data storage in scientific cloud workflow systems

Authors:
Dong Yuan;Yun Yang;Xiao Liu;Gaofeng Zhang;Jinjun Chen
Affiliations:
Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia
Venue:
Concurrency and Computation: Practice & Experience
Year:
2012

Citing 23
Cited 4

Giggle: a framework for constructing scalable replica location services

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Lineage retrieval for scientific data processing: a survey

ACM Computing Surveys (CSUR)
A survey of data provenance in e-science

ACM SIGMOD Record
Taverna: a tool for the composition and enactment of bioinformatics workflows

Bioinformatics
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G

E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Data Management Challenges of Data-Intensive Scientific Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Activity Completion Duration Based Checkpoint Selection for Dynamic Verification of Temporal Constraints in Grid Workflow Systems

International Journal of High Performance Computing Applications
The cost of doing science on the cloud: the Montage example

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scientific Cloud Computing: Early Definition and Experience

HPCC '08 Proceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications
Experience in using a process language to define scientific workflow and generate dataset provenance

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
On the Use of Cloud Computing for Scientific Workflows

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Data placement for scientific applications in distributed environments

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

Future Generation Computer Systems
Differencing Provenance in Scientific Workflows

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters

Proceedings of the 18th ACM international symposium on High performance distributed computing
Cost-benefit analysis of Cloud Computing versus desktop grids

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Recording Process Documentation for Provenance

IEEE Transactions on Parallel and Distributed Systems
A data placement strategy in scientific cloud workflows

Future Generation Computer Systems
Maximizing efficiency by trading storage for computation

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Temporal dependency-based checkpoint selection for dynamic verification of temporal constraints in scientific workflow systems

ACM Transactions on Software Engineering and Methodology (TOSEM)
Provenance collection support in the kepler scientific workflow system

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
SwinDeW-a p2p-based decentralized workflow management system

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems

Journal of Parallel and Distributed Computing
Editorial

Concurrency and Computation: Practice & Experience
A Cost-Effective Mechanism for Cloud Data Reliability Management Based on Proactive Replica Checking

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Role of acquisition intervals in private and public cloud storage costs

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many scientific workflows are data intensive where large volumes of intermediate data are generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science in the cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for-use model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenance in scientific workflows. With the IDG, deleted intermediate data can be regenerated, and as such we develop a novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider. The strategy has significant research merits, i.e. it achieves a cost-effective trade-off of computation cost and storage cost and is not strongly impacted by the forecasting inaccuracy of data sets' usages. Meanwhile, the strategy also takes the users' tolerance of data accessing delay into consideration. We utilize Amazon's cost model and apply the strategy to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly. Copyright © 2010 John Wiley & Sons, Ltd. (A preliminary version of this paper was published in the proceedings of IPDPS'2010, Atlanta, U.S.A., April 2010.)