On assisting scientific data curation in collection-based dataflows using labels

  • Authors:
  • Pinar Alper;Carole A. Goble;Khalid Belhajjame

  • Affiliations:
  • University of Manchester, Manchester, UK;University of Manchester, Manchester, UK;Université Paris Dauphine, Paris, FR

  • Venue:
  • WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Thanks to the proliferation of computational techniques and the availability of datasets, data-intensive research has become commonplace in science. Sharing and re-use of datasets is key to scientific progress. A critical requirement for enabling data re-use, is for data to be accompanied by lineage metadata that describes the context in which data is produced, the source datasets from which it was derived and the tooling or settings involved in its generation. By and large, this metadata is provided through a manual curation process, which is tedious, repetitive and time consuming. In this paper, we explore the problem of curating data artifacts generated from scientific workflows, which have become an established method for organizing computational data analyses. Most workflow systems can be instrumented to gather provenance, i.e. lineage, information about the data artifacts generated as a result of their execution. While this form of raw provenance provides elaborate information on localized lineage traced during a run in the form of data derivation or activity causality relations, it is of little use when one needs to report on lineage in a broader scientific context. And, consequently, datasets resulting from workflow-based analyses also require manual curation prior to their publishing. We argue that by making the analysis process explicit, workflow-based investigations provide an opportunity for semi-automating data curation. In this paper we introduce a novel approach that semi-automates curation through a special kind of workflow, which we call a Labeling Workflow. Using 1) the description of a scientific workflow, 2) a set of semantic annotations characterizing the data processing in workflows, and, 3) a library of label handling functions, we devise a Labeling Workflow, which can be executed over raw provenance in order to curate the data artifacts it refers to. We semi-formally describe the elements of our solution, and showcase its usefulness using an example from Biodiversity.