IEEE Transactions on Pattern Analysis and Machine Intelligence
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Transformation-based Framework for Record Matching
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A grammar-based entity representation framework for data cleaning
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Hi-index | 0.00 |
In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.