Sampling dirty data for matching attributes

  • Authors:
  • Henning Köhler;Xiaofang Zhou;Shazia Sadiq;Yanfeng Shu;Kerry Taylor

  • Affiliations:
  • The University of Queensland, Brisbane, Australia;The University of Queensland and NICTA, Brisbane, Australia;The University of Queensland, Brisbane, Australia;CSIRO - Tasmanian ICT Centre, Hobart, Australia;CSIRO - ICT Centre, Canberra, Australia

  • Venue:
  • Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.