Memory footprint matters: efficient equi-join algorithms for main memory data processing

  • Authors:
  • Spyros Blanas;Jignesh M. Patel

  • Affiliations:
  • University of Wisconsin--Madison;University of Wisconsin--Madison

  • Venue:
  • Proceedings of the 4th annual Symposium on Cloud Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

High-performance analytical data processing systems often run on servers with large amounts of main memory. A common operation in such environments is combining data from two or more sources using some "join" algorithm. The focus of this paper is on studying hash-based and sort-based equi-join algorithms when the data sets being joined fully reside in main memory. We only consider a single node setting, which is an important building block for larger high-performance distributed data processing systems. A critical contribution of this work is in pointing out that in addition to query response time, one must also consider the memory footprint of each join algorithm, as it impacts the number of concurrent queries that can be serviced. Memory footprint becomes an important deployment consideration when running analytical data processing services on hardware that is shared by other concurrent services. We also consider the impact of particular physical properties of the input and the output of each join algorithm. This information is essential for optimizing complex query pipelines with multiple joins. Our key contribution is in characterizing the properties of hash-based and sort-based equi-join algorithms, thereby allowing system implementers and query optimizers to make a more informed choice about which join algorithm to use.