Provenance framework in support of data quality estimation

  • Authors:
  • Beth Plale;Dennis Gannon;Yogesh L. Simmhan

  • Affiliations:
  • Indiana University;Indiana University;Indiana University

  • Venue:
  • Provenance framework in support of data quality estimation
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Science has evolved over the past several decades, from an empirical and theoretical approach to one that includes computational simulations and modeling, commonly known as e-Science. Advances in cyberinfrastructure for e-Science have enabled researchers to run complex, computational investigations that include data access, analysis, and model runs that execute, largely automated, as data-driven workflows. Provenance is metadata that describes the process by which datasets are generated by the workflows. This data derivation history is essential to understand how a datum was created, verify and validate the experimental results, and determine the quality of the derived data. This dissertation makes two key contributions to scientific data management. First, it proposes a low-overhead provenance collection framework for scientific workflows. The Karma Provenance Framework is a prototype implementation that collects provenance activities from automatically instrumented services and builds a data provenance model from runtime information. Karma provides a service interface to query for different forms of provenance. The framework has been applied in the LEAD cyberinfrastructure and its performance validated through empirical analysis. Second, it defines a data quality model for estimating the subjective quality of derived data for scientific applications. The model uses a holistic set of quality metrics, including provenance, intrinsic metadata, quality of service, and community perception, to estimate a numerical quality score for the data. This enables a scientist to select the best quality dataset for their application from numerous that qualify. Experimental studies conducted on a prototype quality broker validate the feasibility and prediction accuracy of the model.