A comparison of join algorithms for log processing in MaPreduce

  • Authors:
  • Spyros Blanas;Jignesh M. Patel;Vuk Ercegovac;Jun Rao;Eugene J. Shekita;Yuanyuan Tian

  • Affiliations:
  • University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA

  • Venue:
  • Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.