On optimizing relational self-joins

Authors:
Yu Cao;Yongluan Zhou;Chee-Yong Chan;Kian-Lee Tan
Affiliations:
National University of Singapore, Singapore, and EMC Labs, China;University of Southern Denmark, Denmark;National University of Singapore, Singapore;National University of Singapore, Singapore
Venue:
Proceedings of the 15th International Conference on Extending Database Technology
Year:
2012

Citing 12
Cited 0

Join indices

ACM Transactions on Database Systems (TODS)
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Faster joins, self-joins and multi-way joins using join indices

Data & Knowledge Engineering
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
On Sort-Merge Algorithm for Band Joins

IEEE Transactions on Knowledge and Data Engineering
Groupwise Processing of Relational Queries

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Diag-Join: An Opportunistic Join Algorithm for 1:N Relationships

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Overview of The System Software of A Parallel Relational Database Machine GRACE

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Hash-Partitioned Join Method Using Dynamic Destaging Strategy

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
An Evaluation of Non-Equijoin Algorithms

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Self-Join Size Estimation in Large-scale Distributed Data Systems

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Self-join, which joins a relation with itself, is a prevalent operation in relational database systems. Despite its wide applicability, there has been little attention devoted to improving its performance. In this paper, we present SCALE (Sort for Clustered Access with Lazy Evaluation), an efficient self-join algorithm, which takes advantage of the fact that both inputs of a self-join operation are instances of the same relation. SCALE first sorts the relation on one join attribute, say R. A. In this way, for every value of the other join attribute, say R. B, its matching R. A tuples are essentially clustered. As SCALE scans the sorted relation, each tuple is joined with its matching tuples co-existing in memory. For tuples where full-range clustered accesses to their matching tuples are not possible, they are buffered and the unfinished part of join processing deferred. Such lazy evaluation minimizes the need for "random" access to the matching tuples. SCALE further optimizes the memory allocation for clustered access and lazy evaluation to keep the processing cost minimal. Our analytical study shows that SCALE degenerates gracefully to a Sort-Merge Join in the worst case. We have also implemented SCALE in PostgreSQL, and results of our extensive experimental study show that it outperforms both Sort-Merge Join and Hybrid Hash Join by a wide margin in (almost) all cases.