A case for MapReduce over the internet

  • Authors:
  • Hrishikesh Gadre;Ivan Rodero;Javier Diaz-Montes;Manish Parashar

  • Affiliations:
  • Rutgers University, Piscataway, New Jersey;Rutgers University, Piscataway, New Jersey;Rutgers University, Piscataway, New Jersey;Rutgers University, Piscataway, New Jersey

  • Venue:
  • Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.