Provenance for MapReduce-based data-intensive workflows

Authors:
Daniel Crawl;Jianwu Wang;Ilkay Altintas
Affiliations:
University of California, San Diego, San Diego, CA, USA;University of California, San Diego, San Diego, CA, USA;University of California, San Diego, San Diego, CA, USA
Venue:
Proceedings of the 6th workshop on Workflows in support of large-scale science
Year:
2011

Citing 18
Cited 2

Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)

Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Patterns for parallel programming

Patterns for parallel programming
A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows

Provenance and Annotation of Data and Processes
A High-Level Distributed Execution Framework for Scientific Workflows

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Heterogeneous composition of models of computation

Future Generation Computer Systems
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A MapReduce-Enabled Scientific Workflow Composition Framework

ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids

IEEE Transactions on Parallel and Distributed Systems
CloudWF: A Computational Workflow System for Clouds Based on Hadoop

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
The Open Provenance Model core specification (v1.1)

Future Generation Computer Systems
Distributed Semantic Web Data Management in HBase and MySQL Cluster

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
A Physical and Virtual Compute Cluster Resource Load Balancing Approach to Data-Parallel Scientific Workflow Scheduling

SERVICES '11 Proceedings of the 2011 IEEE World Congress on Services
Provenance collection support in the kepler scientific workflow system

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

HadoopProv: towards provenance as a first class citizen in MapReduce

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce has been widely adopted by many business and scientific applications for data-intensive processing of large datasets. There are increasing efforts for workflows and systems to work with the MapReduce programming model and the Hadoop environment including our work on a higher-level programming model for MapReduce within the Kepler Scientific Workflow System. However, to date, provenance of MapReduce-based workflows and its effects on workflow execution performance have not been studied in depth. In this paper, we present an extension to our earlier work on MapReduce in Kepler to record the provenance of MapReduce workflows created using the Kepler+Hadoop framework. In particular, we present: (i) a data model that is able to capture provenance inside a MapReduce job as well as the provenance for the workflow that submitted it; (ii) an extension to the Kepler+Hadoop architecture to record provenance using this data model on MySQL Cluster; (iii) a programming interface to query the collected information; and (iv) an evaluation of the scalability of collecting and querying this provenance information using two scenarios with different characteristics.