PG-join: proximity graph based string similarity joins

  • Authors:
  • Michail Kazimianec;Nikolaus Augsten

  • Affiliations:
  • Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy;Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy

  • Venue:
  • SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In many applications, for example, in data integration scenarios, strings must be matched if they are similar. String similarity joins, which match all pairs of similar strings from two datasets, are of particular interest and have recently received much attention in the database research community. Most approaches, however, assume a global similarity threshold; all string pairs that exceed the threshold form a match in the join result. The global threshold approach has two major problems: (a) the threshold depends on the (mostly unknown) data distribution, (b) often there is no single threshold that is good for all string pairs. In this paper we propose the PG-Join algorithm, a novel string similarity join that requires no configuration and uses an adaptive threshold. PG-Join computes a so-called proximity graph to derive an individual threshold for each string. Computing the proximity graph efficiently is essential for the scalability of PG-Join. To this end we develop a new and fast algorithm, PG-I, that computes the proximity graph in two steps: First an efficient approximation is computed, then the approximation error is fixed incrementally until the adaptive threshold is stable. Our extensive experiments on real-world and synthetic data show that PGI is up to five times faster than the state-of-the-art algorithm and suggest that PG-Join is a useful and effective join paradigm.