Multiple relationship based deduplication

  • Authors:
  • Pei Li

  • Affiliations:
  • University of Milan, Milan, Italy

  • Venue:
  • Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Deduplication refers to the task of finding instances that refer to the same entity in a given table. Several techniques have been presented based on a pairwise comparison and a typical result is the definition of three sets of records i) pairwise records that definitively match, ii) pairwise records that definitively do not match, and iii) pairwise records that possibly match. In this paper we present a general approach for domain independent duplicate problems by means of the knowledge stored in the schema where the analyzed table is included. According to the different kinds of relationships, we propose strategies to build and compare the knowledge networks by means of graph-based similarity. Final similarity decision given different relationship categories is carried out by exploiting two probabilistic logic models.