Progressive merge join: a generic and non-blocking sort-based join algorithm

  • Authors:
  • Jens-Peter Dittrich;Bernhard Seeger;David Scot Taylor;Peter Widmayer

  • Affiliations:
  • Department of Mathematics and CS, University of Marburg, Marburg, Germany;Department of Mathematics and CS, University of Marburg, Marburg, Germany;Department of CS, ETHZ, Zürich, Switzerland;Department of CS, ETHZ, Zürich, Switzerland

  • Venue:
  • VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many state-of-the-art join-techniques require the input relations to be almost fully sorted before the actual join processing starts. Thus, these techniques start producing first results only after a considerable time period has passed. This blocking behaviour is a serious problem when consequent operators have to stop processing, in order to wait for first results of the join. Furthermore, this behaviour is not acceptable if the result of the join is visualized or/ and requires user interaction. These are typical scenarios for data mining applications. The, off-time' of existing techniques even increases with growing problem sizes. In this paper, we propose a generic technique called Progressive Merge Join (PMJ) that eliminates the blocking behaviour of sort-based join algorithms. The basic idea behind PMJ is to have the join produce results, as early as the external mergesort generates initial runs. Hence, it is possible for PMJ to return first results very early. This paper provides the basic algorithms and the generic framework of PMJ, as well as use-cases for different types of joins. Moreover, we provide a generic online selectivity estimator with probabilistic quality guarantees. For similarity joins in particular, first non-blocking join algorithms are derived from applying PMJ to the state-of-the-art techniques. We have implemented PMJ as part of an object-relational cursor algebra. A set of experiments shows that a substantial amount of results are produced, even before the input relationas would have been sorted. We observed only a moderate increase in the total runtime compared to the blocking counterparts.