On assisting scientific data curation in collection-based dataflows using labels

Authors:
Pinar Alper;Carole A. Goble;Khalid Belhajjame
Affiliations:
University of Manchester, Manchester, UK;University of Manchester, Manchester, UK;Université Paris Dauphine, Paris, FR
Venue:
WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
Year:
2013

Citing 15
Cited 0

A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
A survey of data provenance in e-science

ACM SIGMOD Record
Quality views: capturing and exploiting the user perspective on data quality

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An annotation management system for relational databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
The provenance of electronic data

Communications of the ACM - The psychology of security: why do good users make bad decisions?
Provenance and scientific workflows: challenges and opportunities

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
The design and realisation of the Experimentmy Virtual Research Environment for social sharing of workflows

Future Generation Computer Systems
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
Taverna, reloaded

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Putting lipstick on pig: enabling database-style workflow provenance

Proceedings of the VLDB Endowment
A calculus for propagating semantic annotations through scientific workflow queries

EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Provenance-Based Debugging and Drill-Down in Data-Oriented Workflows

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Common motifs in scientific workflows: An empirical analysis

E-SCIENCE '12 Proceedings of the 2012 IEEE 8th International Conference on E-Science (e-Science)
D-PROV: extending the PROV provenance model with workflow structure

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance

Quantified Score

Hi-index	0.00

Visualization

Abstract

Thanks to the proliferation of computational techniques and the availability of datasets, data-intensive research has become commonplace in science. Sharing and re-use of datasets is key to scientific progress. A critical requirement for enabling data re-use, is for data to be accompanied by lineage metadata that describes the context in which data is produced, the source datasets from which it was derived and the tooling or settings involved in its generation. By and large, this metadata is provided through a manual curation process, which is tedious, repetitive and time consuming. In this paper, we explore the problem of curating data artifacts generated from scientific workflows, which have become an established method for organizing computational data analyses. Most workflow systems can be instrumented to gather provenance, i.e. lineage, information about the data artifacts generated as a result of their execution. While this form of raw provenance provides elaborate information on localized lineage traced during a run in the form of data derivation or activity causality relations, it is of little use when one needs to report on lineage in a broader scientific context. And, consequently, datasets resulting from workflow-based analyses also require manual curation prior to their publishing. We argue that by making the analysis process explicit, workflow-based investigations provide an opportunity for semi-automating data curation. In this paper we introduce a novel approach that semi-automates curation through a special kind of workflow, which we call a Labeling Workflow. Using 1) the description of a scientific workflow, 2) a set of semantic annotations characterizing the data processing in workflows, and, 3) a library of label handling functions, we devise a Labeling Workflow, which can be executed over raw provenance in order to curate the data artifacts it refers to. We semi-formally describe the elements of our solution, and showcase its usefulness using an example from Biodiversity.