HadoopProv: towards provenance as a first class citizen in MapReduce

Authors:
Sherif Akoush;Ripduman Sohan;Andy Hopper
Affiliations:
University of Cambridge;University of Cambridge;University of Cambridge
Venue:
Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Year:
2013

Citing 11
Cited 0

Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Advances and challenges in log analysis

Communications of the ACM
Provenance for MapReduce-based data-intensive workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
A hybrid approach for efficient provenance storage

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce HadoopProv, a modified version of Hadoop that implements provenance capture and analysis in MapReduce jobs. It is designed to minimise provenance capture overheads by (i) treating provenance tracking in Map and Reduce phases separately, and (ii) deferring construction of the provenance graph to the query stage. Provenance graphs are later joined on matching intermediate keys of the Map and Reduce provenance files. In our prototype implementation, HadoopProv has an overhead below 10% on typical job runtime (k log n), where n is the number of records per Map task and k is the set of Map tasks in which the key appears.