Efficient duplicate record detection based on similarity estimation

Authors:
Mohan Li;Hongzhi Wang;Jianzhong Li;Hong Gao
Affiliations:
Harbin Institute of Technology, Harbin;Harbin Institute of Technology, Harbin;Harbin Institute of Technology, Harbin;Harbin Institute of Technology, Harbin
Venue:
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Year:
2010

Citing 12
Cited 0

Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Machine Learning (Adaptive Computation and Machine Learning)

Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Learning to extract information from semi-structured text using a discriminative context free grammar

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A grammar-based entity representation framework for data cleaning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.