Exploring MapReduce efficiency with highly-distributed data

  • Authors:
  • Michael Cardosa;Chenyu Wang;Anshuman Nangia;Abhishek Chandra;Jon Weissman

  • Affiliations:
  • University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA

  • Venue:
  • Proceedings of the second international workshop on MapReduce and its applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.