Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

  • Authors:
  • Hyuck Han;Hyungsoo Jung;Hyeonsang Eom;Heon Y. Yeom

  • Affiliations:
  • School of Computer Science and Engineering, Seoul National University, Seoul, Korea 151-742;School of Information Technologies, University of Sydney, Sydney, Australia 2006;School of Computer Science and Engineering, Seoul National University, Seoul, Korea 151-742;School of Computer Science and Engineering, Seoul National University, Seoul, Korea 151-742

  • Venue:
  • Cluster Computing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.