Queue - Storage
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stork: Making Data Placement a First Class Citizen in the Grid
ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Turning the postal system into a generic digital communication mechanism
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
High Performance Threaded Data Streaming for Large Scale Simulations
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
An architecture for internet data transfer
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Proceedings of the 5th European conference on Computer systems
New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks
ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
MOON: MapReduce On Opportunistic eNvironments
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Exploring MapReduce efficiency with highly-distributed data
Proceedings of the second international workshop on MapReduce and its applications
ARIA: automatic resource inference and allocation for mapreduce environments
Proceedings of the 8th ACM international conference on Autonomic computing
Cloud federation in a layered service model
Journal of Computer and System Sciences
Hi-index | 0.00 |
In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.