Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Scientific workflow management and the Kepler system: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Patterns for parallel programming
Patterns for parallel programming
A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows
Provenance and Annotation of Data and Processes
A High-Level Distributed Execution Framework for Scientific Workflows
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Heterogeneous composition of models of computation
Future Generation Computer Systems
Distributed data-parallel computing using a high-level programming language
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A MapReduce-Enabled Scientific Workflow Composition Framework
ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids
IEEE Transactions on Parallel and Distributed Systems
CloudWF: A Computational Workflow System for Clouds Based on Hadoop
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Twister: a runtime for iterative MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
The Open Provenance Model core specification (v1.1)
Future Generation Computer Systems
Distributed Semantic Web Data Management in HBase and MySQL Cluster
CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
SERVICES '11 Proceedings of the 2011 IEEE World Congress on Services
Provenance collection support in the kepler scientific workflow system
IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
HadoopProv: towards provenance as a first class citizen in MapReduce
TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce
Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Hi-index | 0.00 |
MapReduce has been widely adopted by many business and scientific applications for data-intensive processing of large datasets. There are increasing efforts for workflows and systems to work with the MapReduce programming model and the Hadoop environment including our work on a higher-level programming model for MapReduce within the Kepler Scientific Workflow System. However, to date, provenance of MapReduce-based workflows and its effects on workflow execution performance have not been studied in depth. In this paper, we present an extension to our earlier work on MapReduce in Kepler to record the provenance of MapReduce workflows created using the Kepler+Hadoop framework. In particular, we present: (i) a data model that is able to capture provenance inside a MapReduce job as well as the provenance for the workflow that submitted it; (ii) an extension to the Kepler+Hadoop architecture to record provenance using this data model on MySQL Cluster; (iii) a programming interface to query the collected information; and (iv) an evaluation of the scalability of collecting and querying this provenance information using two scenarios with different characteristics.