HadoopProv: towards provenance as a first class citizen in MapReduce

  • Authors:
  • Sherif Akoush;Ripduman Sohan;Andy Hopper

  • Affiliations:
  • University of Cambridge;University of Cambridge;University of Cambridge

  • Venue:
  • Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce HadoopProv, a modified version of Hadoop that implements provenance capture and analysis in MapReduce jobs. It is designed to minimise provenance capture overheads by (i) treating provenance tracking in Map and Reduce phases separately, and (ii) deferring construction of the provenance graph to the query stage. Provenance graphs are later joined on matching intermediate keys of the Map and Reduce provenance files. In our prototype implementation, HadoopProv has an overhead below 10% on typical job runtime (k log n), where n is the number of records per Map task and k is the set of Map tasks in which the key appears.