MTCProv: a practical provenance query framework for many-task scientific computing

  • Authors:
  • Luiz M. Gadelha, Jr.;Michael Wilde;Marta Mattoso;Ian Foster

  • Affiliations:
  • Computer Engineering Program, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil and National Laboratory for Scientific Computing, Petrópolis, Brazil;Mathematics and Computer Science Division, Argonne National Laboratory, Chicago, USA and Computation Institute, Argonne National Laboratory and University of Chicago, Chicago, USA;Computer Engineering Program, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;Mathematics and Computer Science Division, Argonne National Laboratory, Chicago, USA and Computation Institute, Argonne National Laboratory and University of Chicago, Chicago, USA and Department o ...

  • Venue:
  • Distributed and Parallel Databases
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific research is increasingly assisted by computer-based experiments. Such experiments are often composed of a vast number of loosely-coupled computational tasks that are specified and automated as scientific workflows. This large scale is also characteristic of the data that flows within such "many-task" computations (MTC). Provenance information can record the behavior of such computational experiments via the lineage of process and data artifacts. However, work to date has focused on lineage data models, leaving unsolved issues of recording and querying other aspects, such as domain-specific information about the experiments, MTC behavior given by resource consumption and failure information, or the impact of environment on performance and accuracy. In this work we contribute with MTCProv, a provenance query framework for many-task scientific computing that captures the runtime execution details of MTC workflow tasks on parallel and distributed systems, in addition to standard prospective and data derivation provenance. To help users query provenance data we provide a high level interface that hides relational query complexities. We evaluate MTCProv using an application in protein science, and describe how important query patterns such as correlations between provenance, runtime data, and scientific parameters are simplified and expressed.