Harnessing input redundancy in a MapReduce framework

Authors:
Shin-gyu Kim;Hyuck Han;Hyungsoo Jung;Hyeonsang Eom;Heon Y. Yeom
Affiliations:
Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea
Venue:
Proceedings of the 2010 ACM Symposium on Applied Computing
Year:
2010

Citing 9
Cited 1

Scale and performance in a distributed file system

ACM Transactions on Computer Systems (TOCS)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Wave computing in the cloud

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Hierarchical merge for scalable MapReduce

Proceedings of the 2012 workshop on Management of big data systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The proliferation of data parallel programming on large clusters has set a new research avenue: accommodating numerous types of data-intensive applications with a feasible plan. Behind the many research efforts, we can observe that there exists a nontrivial amount of redundant I/O in the execution of data-intensive applications. Even the locality-aware scheduling policy in a MapReduce framework is not effective in a cluster environment where storage nodes cannot provide a computation service. In this paper, we introduce Split-Cache to improve the performance of data-intensive OLAP-style applications by reducing redundant I/O in a MapReduce framework. The key strategy to achieve the goal is to cut down the I/O redundancy of reading common input data among applications. SplitCache caches the first input stream in the computing nodes and reuses them for future demand. In execution of the TPC-H benchmark, we achieved 65.5% faster execution and 87% reduction in network traffic in average.