Incorporating provenance in database systems

  • Authors:
  • Hosagrahar V. Jagadish;Adriane P. Chapman

  • Affiliations:
  • University of Michigan;University of Michigan

  • Venue:
  • Incorporating provenance in database systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The importance of maintaining provenance has been widely recognized, particularly with respect to highly-manipulated data. Currently there are two approaches: provenance generated within workflow frameworks, and provenance within a contained relational database. The workflow provenance allows workflow re-execution, and can offer some explanation of results. Within relational databases, knowledge of SQL queries and relational operators is used to express what happened to a tuple. There is a disconnect between these two areas of provenance research. Techniques that work in relational databases cannot be applied to workflow systems because of heterogeneous data types and black-box operators. Meanwhile, the real-life utility of workflow systems has not been extended to database provenance. In the gap between provenance in workflow systems and databases, there are myriads of systems that need provenance. For instance, when creating a new dataset, like MiMI, using several sources and processes, or building an algorithm that generates sequence alignments, like MiBlast. These hybrid systems cannot be mashed into a workflow framework and do not solely exist within a database. This work solves issues that block provenance usage in hybrid systems. In particular, we look at capturing, storing, and using provenance information outside of workflow and database provenance systems. We tackle the problem of how to capture provenance for manual tasks. Database provenance and workflow systems provide no support for tracking the provenance of user actions, but manual effort is often a large component of effort in these hybrid systems. We describe an approach to track and record the user's actions in a queryable form. Once provenance is captured, storage can become prohibitively expensive, in both hybrid and workflow systems. We utilize properties of provenance information and identify several techniques to reduce the provenance store. Additionally, usable provenance is a problem in workflow, database and hybrid provenance systems. Provenance contains both too much and too little information. Provenance from the black-boxes used in workflow and hybrid systems is impossible for a human to understand. We highlight the missing information that can assist user understanding, and develop a model of provenance answers to decrease information overload. Finally, workflow and database systems are designed to explain the results users see; they do not explain why items are not in the result. We allow researchers to specify what they are looking for and answer why it does not exist in the result set.