Set similarity join on probabilistic data

  • Authors:
  • Xiang Lian;Lei Chen

  • Affiliations:
  • The Hong Kong University of Science and Technology, Kowloon, Hong Kong, China;The Hong Kong University of Science and Technology, Kowloon, Hong Kong, China

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS2J) over two probabilistic set databases, under the possible worlds semantics. To efficiently process the PS2J operator, we first reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can filter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS2J processing performance on both real and synthetic data.