The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Consistent query answers in inconsistent databases
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration: a theoretical perspective
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A hierarchical graphical model for record linkage
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
ULDBs: databases with uncertainty and lineage
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Efficient query evaluation on probabilistic databases
The VLDB Journal — The International Journal on Very Large Data Bases
Approximating predicates and expressive queries on probabilistic databases
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Quality Measures in Uncertain Data Management
SUM '07 Proceedings of the 1st international conference on Scalable Uncertainty Management
State of the nation in data integration for bioinformatics
Journal of Biomedical Informatics
ACM Computing Surveys (CSUR)
Data integration with uncertainty
The VLDB Journal — The International Journal on Very Large Data Bases
Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
IMPrECISE: Good-is-good-enough data integration
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Qualitative effects of knowledge rules and user feedback in probabilistic data integration
The VLDB Journal — The International Journal on Very Large Data Bases
Modeling and querying possible repairs in duplicate detection
Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
An Introduction to Duplicate Detection
An Introduction to Duplicate Detection
On-the-fly entity-aware query processing in the presence of linkage
Proceedings of the VLDB Endowment
Entity Resolution and Information Quality
Entity Resolution and Information Quality
Probabilistic Databases
Bucket elimination: a unifying framework for probabilistic inference
UAI'96 Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence
Hi-index | 0.00 |
In current research and practice, deduplication is usually considered as a deterministic approach in which database tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this article, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for one of the most likely situations, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Moreover, the deduplication process becomes almost fully automatic and human effort can be largely reduced. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.