Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data

Authors:
Pradeep Kumar Mantha;Andre Luckow;Shantenu Jha
Affiliations:
Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Rutgers University, Piscataway, NJ, USA
Venue:
Proceedings of third international workshop on MapReduce and its Applications Date
Year:
2012

Citing 11
Cited 2

The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Fast and accurate long-read alignment with Burrows–Wheeler transform

Bioinformatics
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Understanding application-level interoperability: Scaling-out MapReduce over high-performance grids and clouds

Future Generation Computer Systems
Globus Online: Accelerating and Democratizing Science through Cloud-Based Services

IEEE Internet Computing
A hierarchical framework for cross-domain MapReduce execution

Proceedings of the second international workshop on Emerging computational methods for the life sciences
Exploring MapReduce efficiency with highly-distributed data

Proceedings of the second international workshop on MapReduce and its applications
SEAL

Bioinformatics
Towards a common model for pilot-jobs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

Large scale data analytics on clouds

Proceedings of the fourth international workshop on Cloud data management
Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

The volume and complexity of data that must be analyzed in scientific applications is increasing exponentially. Often, this data is distributed, thus efficient processing of large distributed datasets is required, whilst ideally not introducing fundamentally new programming models or methods. For example, extending MapReduce -- a proven and effective programming model for processing large datasets -- to work more effectively on distributed data and on different infrastructure is desirable. MapReduce on distributed data requires effective distributed coordination of computation (map and reduce) and data, as well as distributed data management (in particular the transfer of intermediate data). We posit that this can be achieved with an effective and efficient runtime environment and without refactoring MapReduce itself. To address these requirements, we design and implement Pilot-MapReduce (PMR) -- a flexible, infrastructure-independent runtime environment for MapReduce. PMR is based on Pilot abstractions for both compute (Pilot-Jobs) and data (Pilot-Data): it utilizes Pilot-Jobs to couple the map phase computation to the nearby source data, and Pilot-Data to move intermediate data using parallel data transfers to the reduce phase. We analyze the effectiveness of PMR on applications with different characteristics (e. g. different volumes of intermediate and output data). We investigate the performance of PMR with distributed data using a Word Count and a genome sequencing application over different MapReduce configurations. Our experimental evaluations show that the Pilot abstractions are powerful abstractions for distributed data: PMR can lower the execution time on distributed clusters and that it provides the desired flexibility in the deployment and configuration of MapReduce runs to address specific application characteristics.