A Parallel Hash Join Algorithm for Managing Data Skew

  • Authors:
  • Joel L. Wolf;Philip S. Yu;John Turek;Daniel M. Dias

  • Affiliations:
  • -;-;-;-

  • Venue:
  • IEEE Transactions on Parallel and Distributed Systems
  • Year:
  • 1993

Quantified Score

Hi-index 0.00

Visualization

Abstract

Presents a parallel hash join algorithm that is based on the concept of hierarchicalhashing, to address the problem of data skew. The proposed algorithm splits the usualhash phase into a hash phase and an explicit transfer phase, and adds an extrascheduling phase between these two. During the scheduling phase, a heuristicoptimization algorithm, using the output of the hash phase, attempts to balance the loadacross the multiple processors in the subsequent join phase. The algorithm naturallyidentifies the hash partitions with the largest skew values and splits them as necessary,assigning each of them to an optimal number of processors. Assuming for concreteness aZipf-like distribution of the values in the join column, a join phase which is CPU-bound,and a shared nothing environment, the algorithm is shown to achieve good join phaseload balancing, and to be robust relative to the degree of data skew and the totalnumber of processors. The overall speedup due to this algorithm is compared to someexisting parallel hash join methods. The proposed method does considerably better in high skew situations.