Modeling and querying possible repairs in duplicate detection

  • Authors:
  • George Beskales;Mohamed A. Soliman;Ihab F. Ilyas;Shai Ben-David

  • Affiliations:
  • University of Waterloo;University of Waterloo;University of Waterloo;University of Waterloo

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. Furthermore, replacing the input dirty data with one possible clean instance may result in unrecoverable errors, for example, identification and merging of possible duplicate records in health care systems. In this paper, we treat duplicate detection procedures as data processing tasks with uncertain outcomes. We concentrate on a family of duplicate detection algorithms that are based on parameterized clustering. We propose a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. We show how to efficiently support relational queries under our model, and to allow new types of queries on the set of possible repairs. We give an experimental study illustrating the scalability and the efficiency of our techniques in different configurations.