A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective
VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
A survey of data provenance in e-science
ACM SIGMOD Record
Quality views: capturing and exploiting the user perspective on data quality
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An annotation management system for relational databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
The provenance of electronic data
Communications of the ACM - The psychology of security: why do good users make bad decisions?
Provenance and scientific workflows: challenges and opportunities
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Future Generation Computer Systems
Provenance in Databases: Why, How, and Where
Foundations and Trends in Databases
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Putting lipstick on pig: enabling database-style workflow provenance
Proceedings of the VLDB Endowment
A calculus for propagating semantic annotations through scientific workflow queries
EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Provenance-Based Debugging and Drill-Down in Data-Oriented Workflows
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Common motifs in scientific workflows: An empirical analysis
E-SCIENCE '12 Proceedings of the 2012 IEEE 8th International Conference on E-Science (e-Science)
D-PROV: extending the PROV provenance model with workflow structure
Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Hi-index | 0.00 |
Thanks to the proliferation of computational techniques and the availability of datasets, data-intensive research has become commonplace in science. Sharing and re-use of datasets is key to scientific progress. A critical requirement for enabling data re-use, is for data to be accompanied by lineage metadata that describes the context in which data is produced, the source datasets from which it was derived and the tooling or settings involved in its generation. By and large, this metadata is provided through a manual curation process, which is tedious, repetitive and time consuming. In this paper, we explore the problem of curating data artifacts generated from scientific workflows, which have become an established method for organizing computational data analyses. Most workflow systems can be instrumented to gather provenance, i.e. lineage, information about the data artifacts generated as a result of their execution. While this form of raw provenance provides elaborate information on localized lineage traced during a run in the form of data derivation or activity causality relations, it is of little use when one needs to report on lineage in a broader scientific context. And, consequently, datasets resulting from workflow-based analyses also require manual curation prior to their publishing. We argue that by making the analysis process explicit, workflow-based investigations provide an opportunity for semi-automating data curation. In this paper we introduce a novel approach that semi-automates curation through a special kind of workflow, which we call a Labeling Workflow. Using 1) the description of a scientific workflow, 2) a set of semantic annotations characterizing the data processing in workflows, and, 3) a library of label handling functions, we devise a Labeling Workflow, which can be executed over raw provenance in order to curate the data artifacts it refers to. We semi-formally describe the elements of our solution, and showcase its usefulness using an example from Biodiversity.