Harnessing input redundancy in a MapReduce framework

  • Authors:
  • Shin-gyu Kim;Hyuck Han;Hyungsoo Jung;Hyeonsang Eom;Heon Y. Yeom

  • Affiliations:
  • Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea

  • Venue:
  • Proceedings of the 2010 ACM Symposium on Applied Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The proliferation of data parallel programming on large clusters has set a new research avenue: accommodating numerous types of data-intensive applications with a feasible plan. Behind the many research efforts, we can observe that there exists a nontrivial amount of redundant I/O in the execution of data-intensive applications. Even the locality-aware scheduling policy in a MapReduce framework is not effective in a cluster environment where storage nodes cannot provide a computation service. In this paper, we introduce Split-Cache to improve the performance of data-intensive OLAP-style applications by reducing redundant I/O in a MapReduce framework. The key strategy to achieve the goal is to cut down the I/O redundancy of reading common input data among applications. SplitCache caches the first input stream in the computing nodes and reuses them for future demand. In execution of the TPC-H benchmark, we achieved 65.5% faster execution and 87% reduction in network traffic in average.