Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data

  • Authors:
  • Pradeep Kumar Mantha;Andre Luckow;Shantenu Jha

  • Affiliations:
  • Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Rutgers University, Piscataway, NJ, USA

  • Venue:
  • Proceedings of third international workshop on MapReduce and its Applications Date
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The volume and complexity of data that must be analyzed in scientific applications is increasing exponentially. Often, this data is distributed, thus efficient processing of large distributed datasets is required, whilst ideally not introducing fundamentally new programming models or methods. For example, extending MapReduce -- a proven and effective programming model for processing large datasets -- to work more effectively on distributed data and on different infrastructure is desirable. MapReduce on distributed data requires effective distributed coordination of computation (map and reduce) and data, as well as distributed data management (in particular the transfer of intermediate data). We posit that this can be achieved with an effective and efficient runtime environment and without refactoring MapReduce itself. To address these requirements, we design and implement Pilot-MapReduce (PMR) -- a flexible, infrastructure-independent runtime environment for MapReduce. PMR is based on Pilot abstractions for both compute (Pilot-Jobs) and data (Pilot-Data): it utilizes Pilot-Jobs to couple the map phase computation to the nearby source data, and Pilot-Data to move intermediate data using parallel data transfers to the reduce phase. We analyze the effectiveness of PMR on applications with different characteristics (e. g. different volumes of intermediate and output data). We investigate the performance of PMR with distributed data using a Word Count and a genome sequencing application over different MapReduce configurations. Our experimental evaluations show that the Pilot abstractions are powerful abstractions for distributed data: PMR can lower the execution time on distributed clusters and that it provides the desired flexibility in the deployment and configuration of MapReduce runs to address specific application characteristics.