Modeling and querying possible repairs in duplicate detection

Authors:
George Beskales;Mohamed A. Soliman;Ihab F. Ilyas;Shai Ben-David
Affiliations:
University of Waterloo;University of Waterloo;University of Waterloo;University of Waterloo
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 18
Cited 6

Incomplete Information in Relational Databases

Journal of the ACM (JACM)
On the representation and querying of sets of possible worlds

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Algorithms for clustering data

Algorithms for clustering data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Consistent query answers in inconsistent databases

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
IntelliClean: a knowledge-based intelligent data cleaner

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
TAG: a Tiny AGgregation service for ad-hoc sensor networks

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A Probabilistic XML Approach to Data Integration

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Efficient Clustering of Uncertain Data

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient query evaluation on probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning

Consistent query answers in inconsistent probabilistic databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Explore or exploit?: effective strategies for disambiguating large databases

Proceedings of the VLDB Endowment
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Evaluating indeterministic duplicate detection results

SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
The data analytics group at the qatar computing research institute

ACM SIGMOD Record
Indeterministic Handling of Uncertain Decisions in Deduplication

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. Furthermore, replacing the input dirty data with one possible clean instance may result in unrecoverable errors, for example, identification and merging of possible duplicate records in health care systems. In this paper, we treat duplicate detection procedures as data processing tasks with uncertain outcomes. We concentrate on a family of duplicate detection algorithms that are based on parameterized clustering. We propose a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. We show how to efficiently support relational queries under our model, and to allow new types of queries on the set of possible repairs. We give an experimental study illustrating the scalability and the efficiency of our techniques in different configurations.