Efficient provenance storage over nested data collections

Authors:
Manish Kumar Anand;Shawn Bowers;Timothy McPhillips;Bertram Ludäscher
Affiliations:
University of California, Davis;University of California, Davis;University of California, Davis;University of California, Davis
Venue:
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Year:
2009

Citing 34
Cited 14

Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Representing and Querying Changes in Semistructured Data

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Change-Centric Management of Versions in an XML Warehouse

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient schemes for managing multiversionXML documents

The VLDB Journal — The International Journal on Very Large Data Bases
Lineage tracing for general data warehouse transformations

The VLDB Journal — The International Journal on Very Large Data Bases
Advances in dataflow programming languages

ACM Computing Surveys (CSUR)
Exchanging intensional XML data

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
A survey of data provenance in e-science

ACM SIGMOD Record
Processing queries on tree-structured data efficiently

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Provenance management in curated databases

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Querying xml with update syntax

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Attribute grammars for scalable query processing on XML streams

The VLDB Journal — The International Journal on Very Large Data Bases
Provenance for Visualizations: Reproducibility and Beyond

Computing in Science and Engineering
Examining the Challenges of Scientific Workflows

Computer
Mining Taverna's semantic web of provenance

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Special Issue: The First Provenance Challenge

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Automatic capture and efficient storage of e-Science experiment provenance

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Advanced data flow support for scientific grid workflow applications

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data Management Challenges of Data-Intensive Scientific Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Efficient provenance storage

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient lineage tracking for scientific workflows

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance and scientific workflows: challenges and opportunities

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Experience in using a process language to define scientific workflow and generate dataset provenance

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Scientific workflow design for mere mortals

Future Generation Computer Systems
Run-time Optimisation of Grid Workflow Applications

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Querying and Managing Provenance through User Views in Scientific Workflows

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Wings for Pegasus: creating large-scale scientific applications using semantic representations of computational workflows

IAAI'07 Proceedings of the 19th national conference on Innovative applications of artificial intelligence - Volume 2
Project histories: managing data provenance across collection-oriented scientific workflow runs

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Petri net + nested relational calculus = dataflow

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
Towards a model of provenance and user views in scientific workflows

DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences
Provenance collection support in the kepler scientific workflow system

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
A model for user-oriented data provenance in pipelined scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Applying the virtual data provenance model

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
A navigation model for exploring scientific workflow provenance graphs

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Understanding provenance black boxes

Distributed and Parallel Databases
Techniques for efficiently querying scientific workflow provenance graphs

Proceedings of the 13th International Conference on Extending Database Technology
Fine-grained and efficient lineage querying of collection-based workflow provenance

Proceedings of the 13th International Conference on Extending Database Technology
A graph model of data and workflow provenance

TAPP'10 Proceedings of the 2nd conference on Theory and practice of provenance
The Foundations for Provenance on the Web

Foundations and Trends in Web Science
Storing, reasoning, and querying OPM-compliant scientific workflow provenance using relational databases

Future Generation Computer Systems
Database support for exploring scientific workflow provenance graphs

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
WebLab PROV: computing fine-grained provenance links for XML artifacts

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Provenance for data mining

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
Provenance for data mining

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Ariadne: managing fine-grained provenance on data streams

Proceedings of the 7th ACM international conference on Distributed event-based systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by "provenance-aware" processes, and it can lead to inefficient storage when workflow data is complex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction techniques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, storage size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information.