Giggle: a framework for constructing scalable replica location services
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Lineage retrieval for scientific data processing: a survey
ACM Computing Surveys (CSUR)
A survey of data provenance in e-science
ACM SIGMOD Record
Scientific workflow management and the Kepler system: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
netWorker - Cloud computing: PC functions move onto the web
Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G
E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Data Management Challenges of Data-Intensive Scientific Workflows
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
The cost of doing science on the cloud: the Montage example
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scientific Cloud Computing: Early Definition and Experience
HPCC '08 Proceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications
Experience in using a process language to define scientific workflow and generate dataset provenance
Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
On the Use of Cloud Computing for Scientific Workflows
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Data placement for scientific applications in distributed environments
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Future Generation Computer Systems
Differencing Provenance in Scientific Workflows
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters
Proceedings of the 18th ACM international symposium on High performance distributed computing
Cost-benefit analysis of Cloud Computing versus desktop grids
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Recording Process Documentation for Provenance
IEEE Transactions on Parallel and Distributed Systems
A data placement strategy in scientific cloud workflows
Future Generation Computer Systems
Maximizing efficiency by trading storage for computation
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Provenance collection support in the kepler scientific workflow system
IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
SwinDeW-a p2p-based decentralized workflow management system
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
A data dependency based strategy for intermediate data storage in scientific cloud workflow systems
Concurrency and Computation: Practice & Experience
Topological ordering algorithm for LDAG
Information Processing Letters
Future Generation Computer Systems
A personalised search approach for web service recommendation
International Journal of Ad Hoc and Ubiquitous Computing
International Journal of Ad Hoc and Ubiquitous Computing
Design and implementation of P2P reasoning system based on description logic
International Journal of Ad Hoc and Ubiquitous Computing
Role of acquisition intervals in private and public cloud storage costs
Decision Support Systems
Hi-index | 0.00 |
Many scientific workflows are data intensive: large volumes of intermediate datasets are generated during their execution. Some valuable intermediate datasets need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on clouds has become popular nowadays, more intermediate datasets in scientific cloud workflows can be stored by different storage strategies based on a pay-as-you-go model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenances in scientific workflows. With the IDG, deleted intermediate datasets can be regenerated, and as such we develop a novel algorithm that can find a minimum cost storage strategy for the intermediate datasets in scientific cloud workflow systems. The strategy achieves the best trade-off of computation cost and storage cost by automatically storing the most appropriate intermediate datasets in the cloud storage. This strategy can be utilised on demand as a minimum cost benchmark for all other intermediate dataset storage strategies in the cloud. We utilise Amazon clouds' cost model and apply the algorithm to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that benchmarking effectively demonstrates the cost effectiveness over other representative storage strategies.