Using intrinsic data skew to improve hash join performance

Authors:
Bryce Cutt;Ramon Lawrence
Affiliations:
Department of Computer Science, University of British Columbia, Okanagan, 3333 University Way Kelowna, British Columbia, Canada V1V 1V7;Department of Computer Science, University of British Columbia, Okanagan, 3333 University Way Kelowna, British Columbia, Canada V1V 1V7
Venue:
Information Systems
Year:
2009

Citing 9
Cited 0

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
The effect of bucket size tuning in the dynamic hybrid GRACE hash join method

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Hash-Partitioned Join Method Using Dynamic Destaging Strategy

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Early hash join: a configurable algorithm for the efficient and early production of join results

VLDB '05 Proceedings of the 31st international conference on Very large data bases
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hash join is used to join large, unordered relations and operates independently of the data distributions of the join relations. Real-world data sets are not uniformly distributed and often contain significant skew. Although partition skew has been studied for hash joins, no prior work has examined how exploiting data skew can improve the performance of hash join. In this paper, we present histojoin, a join algorithm that uses histograms to identify data skew and improve join performance. Experimental results show that for skewed data sets histojoin performs significantly fewer I/O operations and is faster by 10-60% than hybrid hash join.