Techniques for efficiently querying scientific workflow provenance graphs

Authors:
Manish Kumar Anand;Shawn Bowers;Bertram Ludäscher
Affiliations:
University of California, Davis;Gonzaga University;University of California, Davis
Venue:
Proceedings of the 13th International Conference on Extending Database Technology
Year:
2010

Citing 20
Cited 9

Managing semistructured data with florid: a deductive object-oriented perspective

Information Systems - Special issue on semistructured data
Dual Labeling: Answering Graph Reachability Queries in Constant Time

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Storing and Querying Scientific Workflow Provenance Metadata Using an RDBMS

E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Special Issue: The First Provenance Challenge

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Automatic capture and efficient storage of e-Science experiment provenance

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Tackling the Provenance Challenge one layer at a time

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Data Management Challenges of Data-Intensive Scientific Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Graphs-at-a-time: query language and access methods for graph databases

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficiently answering reachability queries on very large directed graphs

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient provenance storage

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient lineage tracking for scientific workflows

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance and scientific workflows: challenges and opportunities

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient provenance storage over nested data collections

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Querying and Managing Provenance through User Views in Scientific Workflows

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Differencing Provenance in Scientific Workflows

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
A model for user-oriented data provenance in pipelined scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Performance evaluation of the karma provenance framework for scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

PROPUB: towards a declarative approach for publishing customized, policy-aware provenance

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Search, adapt, and reuse: the future of scientific workflows

ACM SIGMOD Record
Query language constructs for provenance

Proceedings of the 15th Symposium on International Database Engineering & Applications
Reconciling provenance policy conflicts by inventing anonymous nodes

ESWC'11 Proceedings of the 8th international conference on The Semantic Web
Query languages for graph databases

ACM SIGMOD Record
Database support for exploring scientific workflow provenance graphs

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
WebLab PROV: computing fine-grained provenance links for XML artifacts

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Querying graph databases

Proceedings of the 32nd symposium on Principles of database systems
Editorial: OPQL: Querying scientific workflow provenance at the graph level

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key advantage of scientific workflow systems over traditional scripting approaches is their ability to automatically record data and process dependencies introduced during workflow runs. This information is often represented through provenance graphs, which can be used by scientists to better understand, reproduce, and verify scientific results. However, while most systems record and store data and process dependencies, few provide easy-to-use and efficient approaches for accessing and querying provenance information. Instead, users formulate provenance graph queries directly against physical data representations (e.g., relational, XML, or RDF), leading to queries that are difficult to express and expensive to evaluate. We address these problems through a high-level query language tailored for expressing provenance graph queries. The language is based on a general model of provenance supporting scientific workflows that process XML data and employ update semantics. Query constructs are provided for querying both structure and lineage information. Unlike other languages that return sets of nodes as answers, our query language is closed, i.e., answers to lineage queries are sets of lineage dependencies (edges) allowing answers to be further queried. We provide a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries. Experimental results on real and synthetic provenance traces demonstrate that our lineage based optimizations outperform an in-memory and standard database implementation by orders of magnitude. We also show that our strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.