A data dependency based strategy for intermediate data storage in scientific cloud workflow systems

  • Authors:
  • Dong Yuan;Yun Yang;Xiao Liu;Gaofeng Zhang;Jinjun Chen

  • Affiliations:
  • Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne, Vic. 3122, Australia

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many scientific workflows are data intensive where large volumes of intermediate data are generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science in the cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for-use model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenance in scientific workflows. With the IDG, deleted intermediate data can be regenerated, and as such we develop a novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider. The strategy has significant research merits, i.e. it achieves a cost-effective trade-off of computation cost and storage cost and is not strongly impacted by the forecasting inaccuracy of data sets' usages. Meanwhile, the strategy also takes the users' tolerance of data accessing delay into consideration. We utilize Amazon's cost model and apply the strategy to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly. Copyright © 2010 John Wiley & Sons, Ltd. (A preliminary version of this paper was published in the proceedings of IPDPS'2010, Atlanta, U.S.A., April 2010.)