A precise blocking method for record linkage

Authors:
Patrick Lehti;Peter Fankhauser
Affiliations:
Fraunhofer IPSI, Darmstadt, Germany;Fraunhofer IPSI, Darmstadt, Germany
Venue:
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Year:
2005

Citing 4
Cited 3

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Modern Information Retrieval

Modern Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence

XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Probabilistic iterative duplicate detection

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Unsupervised duplicate detection using sample non-duplicates

Journal on Data Semantics VII

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying approximately duplicate records between databases requires the costly computation of distances between their attributes. Thus duplicate detection is usually performed in two phases, an efficient blocking phase that determines few potential candidate duplicates based on simple criteria, followed by a second phase performing an in-depth comparison of the candidate duplicates. This paper introduces and evaluates a precise and efficient approach for the blocking phase, which requires only standard indices, but performs as well as other approaches based on special purpose indices, and outperforms other approaches based on standard indices. The key idea of the approach is to use a comparison window with a size that depends dynamically on a maximum distance, rather than using a window with fixed size.