Indeterministic Handling of Uncertain Decisions in Deduplication

Authors:
Fabian Panse;Maurice van Keulen;Norbert Ritter
Affiliations:
University of Hamburg, Germany;University of Twente, the Netherlands;University of Hamburg, Germany
Venue:
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Year:
2013

Citing 24
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Consistent query answers in inconsistent databases

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration: a theoretical perspective

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Efficient query evaluation on probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
Approximating predicates and expressive queries on probabilistic databases

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Quality Measures in Uncertain Data Management

SUM '07 Proceedings of the 1st international conference on Scalable Uncertainty Management
State of the nation in data integration for bioinformatics

Journal of Biomedical Informatics
Data fusion

ACM Computing Surveys (CSUR)
Data integration with uncertainty

The VLDB Journal — The International Journal on Very Large Data Bases
Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
IMPrECISE: Good-is-good-enough data integration

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Qualitative effects of knowledge rules and user feedback in probabilistic data integration

The VLDB Journal — The International Journal on Very Large Data Bases
Modeling and querying possible repairs in duplicate detection

Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
An Introduction to Duplicate Detection

An Introduction to Duplicate Detection
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment
Entity Resolution and Information Quality

Entity Resolution and Information Quality
Probabilistic Databases

Probabilistic Databases
Bucket elimination: a unifying framework for probabilistic inference

UAI'96 Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In current research and practice, deduplication is usually considered as a deterministic approach in which database tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this article, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for one of the most likely situations, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Moreover, the deduplication process becomes almost fully automatic and human effort can be largely reduced. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.