Progressive merge join: a generic and non-blocking sort-based join algorithm

Authors:
Jens-Peter Dittrich;Bernhard Seeger;David Scot Taylor;Peter Widmayer
Affiliations:
Department of Mathematics and CS, University of Marburg, Marburg, Germany;Department of Mathematics and CS, University of Marburg, Marburg, Germany;Department of CS, ETHZ, Zürich, Switzerland;Department of CS, ETHZ, Zürich, Switzerland
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 23
Cited 22

Computational geometry: an introduction

Computational geometry: an introduction
Spatial query processing in an object-oriented database system

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
Percentile finding algorithm for multiple sorted runs

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Partition based spatial-merge join

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Selectivity and cost estimation for joins based on random sampling

Journal of Computer and System Sciences
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Size separation spatial join

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
An adaptive query execution system for data integration

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
High performance clustering based on the similarity join

Proceedings of the ninth international conference on Information and knowledge management
Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
On Sort-Merge Algorithm for Band Joins

IEEE Transactions on Knowledge and Data Engineering
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

IEEE Transactions on Knowledge and Data Engineering
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Efficient Evaluation of the Valid-Time Natural Join

Proceedings of the Tenth International Conference on Data Engineering
Dynamic Memory Adjustment for External Mergesort

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Scalable Sweeping-Based Spatial Join

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
A Parallel Processing Strategy for Evaluating Recursive Queries

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries

Proceedings of the 27th International Conference on Very Large Data Bases
An Algorithm for Computing the Overlay of k-Dimensional Spaces

SSD '91 Proceedings of the Second International Symposium on Advances in Spatial Databases
Data Redundancy and Duplicate Detection in Spatial Join Processing

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

On producing join results early

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Join operations in temporal databases

The VLDB Journal — The International Journal on Very Large Data Bases
RPJ: producing fast join results on streams through rate-based optimization

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Early hash join: a configurable algorithm for the efficient and early production of join results

VLDB '05 Proceedings of the 31st international conference on Very large data bases
NSJ: an efficient non-blocking spatial join algorithm

GIS '06 Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The effect of reading policy on early join result production

Information Sciences: an International Journal
Scalable approximate query processing with the DBO engine

ACM Transactions on Database Systems (TODS)
Semantics and implementation of continuous sliding window queries over data streams

ACM Transactions on Database Systems (TODS)
Automating the loading of business process data warehouses

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
RRPJ: result-rate based progressive relational join

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Danaïdes: continuous and progressive complex queries on RSS feeds

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Processing exact results for sliding window joins over data streams using disk storage

International Journal of Intelligent Information and Database Systems
Predicate-based indexing for desktop search

The VLDB Journal — The International Journal on Very Large Data Bases
A disk-based, adaptive approach to memory-limited computation of windowed stream joins

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
New algorithms for join and grouping operations

Computer Science - Research and Development
SharedDB: killing one thousand queries with one stone

Proceedings of the VLDB Endowment
Progressive high-dimensional similarity join

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many state-of-the-art join-techniques require the input relations to be almost fully sorted before the actual join processing starts. Thus, these techniques start producing first results only after a considerable time period has passed. This blocking behaviour is a serious problem when consequent operators have to stop processing, in order to wait for first results of the join. Furthermore, this behaviour is not acceptable if the result of the join is visualized or/ and requires user interaction. These are typical scenarios for data mining applications. The, off-time' of existing techniques even increases with growing problem sizes. In this paper, we propose a generic technique called Progressive Merge Join (PMJ) that eliminates the blocking behaviour of sort-based join algorithms. The basic idea behind PMJ is to have the join produce results, as early as the external mergesort generates initial runs. Hence, it is possible for PMJ to return first results very early. This paper provides the basic algorithms and the generic framework of PMJ, as well as use-cases for different types of joins. Moreover, we provide a generic online selectivity estimator with probabilistic quality guarantees. For similarity joins in particular, first non-blocking join algorithms are derived from applying PMJ to the state-of-the-art techniques. We have implemented PMJ as part of an object-relational cursor algebra. A set of experiments shows that a substantial amount of results are produced, even before the input relationas would have been sorted. We observed only a moderate increase in the total runtime compared to the blocking counterparts.