Differencing Provenance in Scientific Workflows

Authors:
Zhuowei Bao;Sarah Cohen-Boulakia;Susan B. Davidson;Anat Eyal;Sanjeev Khanna
Affiliations:
-;-;-;-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 13

BioBrowsing: Making the Most of the Data Available in Entrez

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
PDiffView: viewing the difference in provenance of workflow results

Proceedings of the VLDB Endowment
Techniques for efficiently querying scientific workflow provenance graphs

Proceedings of the 13th International Conference on Extending Database Technology
Fine-grained and efficient lineage querying of collection-based workflow provenance

Proceedings of the 13th International Conference on Extending Database Technology
Efficiently supporting secure and reliable collaboration in scientific workflows

Journal of Computer and System Sciences
On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems

Journal of Parallel and Distributed Computing
Searching workflows with hierarchical views

Proceedings of the VLDB Endowment
The Foundations for Provenance on the Web

Foundations and Trends in Web Science
Search, adapt, and reuse: the future of scientific workflows

ACM SIGMOD Record
A data dependency based strategy for intermediate data storage in scientific cloud workflow systems

Concurrency and Computation: Practice & Experience
Towards semantic comparison of multi-granularity process traces

Knowledge-Based Systems
Efficient recovery of missing events

Proceedings of the VLDB Endowment
Editorial: OPQL: Querying scientific workflow provenance at the graph level

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific workflow management systems are increasingly providing the ability to manage and query the provenance of data products. However, the problem of differencing the provenance of two data products produced by executions of the same specification has not been adequately addressed. Although this problem is NP-hard for general workflow specifications, an analysis of real scientific (and business) workflows shows that their specifications can be captured as series-parallel graphs overlaid with well-nested forking and looping. For this natural restriction, we present efficient, polynomial-time algorithms for differencing executions of the same specification and thereby understanding the difference in the provenance of their data products. We then describe a prototype called PDiffView built around our differencing algorithm. Experimental results demonstrate the scalability of our approach using collected, real workflows and increasingly complex runs.