Optimizing intermediate data management in MapReduce computations

Authors:
Diana Moise;Thi-Thu-Lan Trieu;Luc Bougé;Gabriel Antoniu
Affiliations:
INRIA Rennes - Bretagne Atlantique/IRISA;ENS Cachan, Brittany/IRISA;ENS Cachan, Brittany/IRISA;INRIA Rennes - Bretagne Atlantique/IRISA
Venue:
Proceedings of the First International Workshop on Cloud Computing Platforms
Year:
2011

Citing 6
Cited 2

Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
On availability of intermediate data in cloud computations

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
BlobSeer: Next-generation data management for large scale infrastructures

Journal of Parallel and Distributed Computing

On modelling and prediction of total CPU usage for applications in mapreduce environments

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Analyzing job completion reliability and job energy consumption for a general MapReduce infrastructure

Journal of High Speed Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many cloud computations process large datasets. Programming paradigms have been proposed to design this type of applications, so as to take advantage of the huge processing and storage options the cloud holds, but at the same time, to provide the user with a clean and easy to use interface. Among these programming models, we consider the MapReduce paradigm and its reference implementation, the Hadoop framework. We focus on the aspect of intermediate data, that is data produced and transferred between the two stages of the computation (map and reduce). The goal of this paper is to propose a storage mechanism for intermediate data with the purpose of optimizing the execution of MapReduce applications in the presence of failures, while keeping the impact on the job completion time to the minimum. To meet this goal, we rely on a fault-tolerant, concurrency-optimized data storage layer based on the BlobSeer data management service. We modify the Hadoop MapReduce framework to store the intermediate data in this layer (acting as a BlobSeer-based distributed file system) rather than using the local storage of the mappers, as in the vanilla version of Hadoop. To validate this work, we perform experiments on a large number of nodes of the Grid'5000 testbed. We demonstrate that our approach not only provides for intermediate data availability in case of failures, but also efficiently handles read/write accesses so that the overall job completion time is substantially improved.