Investigation of Data Locality in MapReduce

Authors:
Zhenhua Guo;Geoffrey Fox;Mo Zhou
Affiliations:
-;-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 11
Cited 4

Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
An evaluation of the close-to-files processor and data co-allocation policy in multiclusters

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Maximizing data locality in distributed systems

Journal of Computer and System Sciences
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
A hierarchical framework for cross-domain MapReduce execution

Proceedings of the second international workshop on Emerging computational methods for the life sciences
Purlieus: locality-aware resource allocation for MapReduce in a cloud

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Automatic Task Re-organization in MapReduce

CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing

Investigation of data locality and fairness in MapReduce

Proceedings of third international workshop on MapReduce and its Applications Date
Monte Carlo simulation on heterogeneous distributed systems: A computing framework with parallel merging and checkpointing strategies

Future Generation Computer Systems
Exploiting MapReduce and data compression for data-intensive applications

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
Joint optimization of overlapping phases in MapReduce

Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional HPC architectures separate compute nodes and storage nodes, which are interconnected with high speed links to satisfy data access requirements in multi-user environments. However, the capacity of those high speed links is still much less than the aggregate bandwidth of all compute nodes. In Data Parallel Systems such as GFS/MapReduce, clusters are built with commodity hardware and each node takes the roles of both computation and storage, which makes it possible to bring compute to data. Data locality is a significant advantage of data parallel systems over traditional HPC systems. Good data locality reduces cross-switch network traffic - one of the bottlenecks in data-intensive computing. In this paper, we investigate data locality in depth. Firstly, we build a mathematical model of scheduling in MapReduce and theoretically analyze the impact on data locality of configuration factors, such as the numbers of nodes and tasks. Secondly, we find the default Hadoop scheduling is non-optimal and propose an algorithm that schedules multiple tasks simultaneously rather than one by one to give optimal data locality. Thirdly, we run extensive experiments to quantify performance improvement of our proposed algorithms, measure how different factors impact data locality, and investigate how data locality influences job execution time in both single-cluster and cross-cluster environments.