A case for MapReduce over the internet

Authors:
Hrishikesh Gadre;Ivan Rodero;Javier Diaz-Montes;Manish Parashar
Affiliations:
Rutgers University, Piscataway, New Jersey;Rutgers University, Piscataway, New Jersey;Rutgers University, Piscataway, New Jersey;Rutgers University, Piscataway, New Jersey
Venue:
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Year:
2013

Citing 16
Cited 0

Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
A Conversation with Jim Gray

Queue - Storage
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Turning the postal system into a generic digital communication mechanism

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
High Performance Threaded Data Streaming for Large Scale Simulations

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
An architecture for internet data transfer

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks

ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
MOON: MapReduce On Opportunistic eNvironments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Exploring MapReduce efficiency with highly-distributed data

Proceedings of the second international workshop on MapReduce and its applications
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
Cloud federation in a layered service model

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.