Applying the virtual data provenance model

Authors:
Yong Zhao;Michael Wilde;Ian Foster
Affiliations:
University of Chicago;University of Chicago and Argonne National Laboratory;University of Chicago and Argonne National Laboratory
Venue:
IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Year:
2006

Citing 23
Cited 13

Tracing the lineage of view data in a warehousing environment

ACM Transactions on Database Systems (TODS)
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
Supporting Fine-grained Data Lineage in a Database Visualization Environment

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Why and Where: A Characterization of Data Provenance

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation

SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
Re-Integrating the Research Record

Computing in Science and Engineering
Practical Lineage Tracing in Data Warehouses

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
A notation and system for expressing and executing cleanly typed workflows on messy scientific data

ACM SIGMOD Record
Managing the Evolution of Dataflows with VisTrails

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Virtual data Grid middleware services for data-intensive science: Research Articles

Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
The QuarkNet/grid collaborative learning e-Lab

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid - Volume 01
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Provenance-aware storage systems

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Globus toolkit version 4: software for service-oriented systems

NPC'05 Proceedings of the 2005 IFIP international conference on Network and Parallel Computing
Automatic generation of workflow provenance

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Managing rapidly-evolving scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Virtual logbooks and collaboration in science and software development

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Applying provenance in distributed organ transplant management

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Provenance implementation in a scientific simulation environment

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Enabling provenance on large scale e-science applications

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Issues in automatic provenance collection

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
AstroDAS: sharing assertions across astronomy catalogues through distributed annotation

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
An identity crisis in the life sciences

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

Provenance Querying for End-Users: A Drug Resistance Case Study

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
A Logic Programming Approach to Scientific Workflow Provenance Querying

Provenance and Annotation of Data and Processes
A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows

Provenance and Annotation of Data and Processes
Neuroimaging Data Provenance Using the LONI Pipeline Workflow Environment

Provenance and Annotation of Data and Processes
Scientific workflow design for mere mortals

Future Generation Computer Systems
Efficient provenance storage over nested data collections

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Data genome: an abstract model for data evolution

ISICA'07 Proceedings of the 2nd international conference on Advances in computation and intelligence
Provenance tracking in the virolab virtual laboratory

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
The Foundations for Provenance on the Web

Foundations and Trends in Web Science
Provenance management in Swift

Future Generation Computer Systems
Exploring provenance in high performance scientific computing

Proceedings of the first annual workshop on High performance computing meets databases
Performance evaluation of the karma provenance framework for scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
MTCProv: a practical provenance query framework for many-task scientific computing

Distributed and Parallel Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many domains of science, engineering, and commerce, data analysis systems are employed to derive new data (and ultimately, one hopes, knowledge) from datasets describing experimental results or simulated phenomena. To support such analyses, we have developed a “virtual data system” that allows users first to define, then to invoke, and finally explore the provenance of procedures (and workflows comprising multiple procedure calls) that perform such data derivations. The underlying execution model is “functional” in the sense that procedures read (but do not modify) their input and produce output via deterministic computations. This property makes it straightforward for the virtual data system to record not only the recipe for producing any given data object but also sufficient information about the environment in which the recipe has been executed, all with sufficient fidelity that the steps used to create a data object can be re-executed to reproduce the data object at a later time or a different location. The virtual data system maintains this information in an integrated schema alongside semantic annotations, and thus enables a powerful query capability in which the rich semantic information implied by knowledge of the structure of data derivation procedures can be exploited to provide an information environment that fuses recipe, history, and application-specific semantics. We provide here an overview of this integration, the queries and transformations that it enables, and examples of how these capabilities can serve scientific processes.