Capturing and querying workflow runtime provenance with PROV: a practical approach

Authors:
Flavio Costa;Vítor Silva;Daniel de Oliveira;Kary Ocaña;Eduardo Ogasawara;Jonas Dias;Marta Mattoso
Affiliations:
COPPE/Federal University of Rio de Janeiro, Brazil;COPPE/Federal University of Rio de Janeiro, Brazil;COPPE/Federal University of Rio de Janeiro, Brazil;COPPE/Federal University of Rio de Janeiro, Brazil;COPPE/Federal University of Rio de Janeiro, Brazil and CEFET-RJ, Brazil;COPPE/Federal University of Rio de Janeiro, Brazil;COPPE/Federal University of Rio de Janeiro, Brazil
Venue:
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Year:
2013

Citing 18
Cited 4

On Plug-ins and Extensible Architectures

Queue - Patching and Deployment
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Provenance Services for Distributed Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Provenance for Computational Tasks: A Survey

Computing in Science and Engineering
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Pipeline-centric provenance model

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Scientific workflows and clouds

Crossroads - Plugging Into the Cloud
SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
Improving workflow fault tolerance through provenance-based recovery

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Scheduling Scientific Workflows Elastically for Cloud Computing

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
Many task computing for orthologous genes identification in protozoan genomes using Hydra

Concurrency and Computation: Practice & Experience
Supporting dynamic parameter sweep in adaptive and user-steered workflow

Proceedings of the 6th workshop on Workflows in support of large-scale science
Performance evaluation of the karma provenance framework for scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
An adaptive parallel execution strategy for cloud-based scientific workflows

Concurrency and Computation: Practice & Experience
Using Broadcast Networks to Create On-demand Extremely Large Scale High-throughput Computing Infrastructures

Journal of Grid Computing
A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

Journal of Grid Computing
Enabling re-executions of parallel scientific workflows using runtime provenance data

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes

Dimensioning the virtual cluster for parallel scientific workflows in clouds

Proceedings of the 4th ACM workshop on Scientific cloud computing
User-steering of HPC workflows: state-of-the-art and future directions

Proceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Runtime Dynamic Structural Changes of Scientific Workflows in Clouds

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Report from the second workshop on scalable workflow enactment engines and technology (SWEET'13)

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific workflows are commonly used to model and execute large-scale scientific experiments. They represent key resources for scientists and are enacted and managed by Scientific Workflow Management Systems (SWfMS). Each SWfMS has its particular approach to execute workflows and to capture and manage their provenance data. Due to the large scale of experiments, it may be unviable to analyze provenance data only after the end of the execution. A single experiment may demand weeks to run, even in high performance computing environments. Thus scientists need to monitor the experiment during its execution, and this can be done through provenance data. Runtime provenance analysis allows for scientists to monitor workflow execution and to take actions before the end of it (i.e. workflow steering). This provenance data can also be used to fine-tune the parallel execution of the workflow dynamically. We use the PROV data model as a basic framework for modeling and providing runtime provenance as a database that can be queried even during the execution. This database is agnostic of SWfMS and workflow engine. We show the benefits of representing and sharing runtime provenance data for improving the experiment management as well as the analysis of the scientific data.