A graph model of data and workflow provenance

  • Authors:
  • Umut Acar;Peter Buneman;James Cheney;Jan Van Den Bussche;Natalia Kwasnikowska;Stijn Vansummeren

  • Affiliations:
  • Max-Planck Institute for Software Systems;University of Edinburgh;University of Edinburgh;Hasselt University;Hasselt University;Université Libre de Bruxelles

  • Venue:
  • TAPP'10 Proceedings of the 2nd conference on Theory and practice of provenance
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Provenance has been studied extensively in both database and workflow management systems, so far with little convergence of definitions or models. Provenance in databases has generally been defined for relational or complex object data, by propagating fine-grained annotations or algebraic expressions from the input to the output. This kind of provenance has been found useful in other areas of computer science: annotation databases, probabilistic databases, schema and data integration, etc. In contrast, workflow provenance aims to capture a complete description of evaluation - or enactment - of a workflow, and this is crucial to verification in scientific computation. Workflows and their provenance are often presented using graphical notation, making them easy to visualize but complicating the formal semantics that relates their run-time behavior with their provenance records. We bridge this gap by extending a previously-developed dataflow language which supports both database-style querying and workflow-style batch processing steps to produce a workflow-style provenance graph that can be explicitly queried. We define and describe the model through examples, present queries that extract other forms of provenance, and give an executable definition of the graph semantics of dataflow expressions.