Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A Heterogeneous Field Matching Method for Record Linkage
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Generating query substitutions
Proceedings of the 15th international conference on World Wide Web
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Incorporating string transformations in record matching
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Optimizing relevance and revenue in ad search: a query substitution approach
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Transformation-based Framework for Record Matching
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Robust similarity measures for named entities matching
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Learning string transformations from examples
Proceedings of the VLDB Endowment
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hi-index | 0.00 |
Record linkage identifies multiple records referring to the same entity even if they are not bit-wise identical. It is thus an essential technology for data integration and data cleansing. Existing record linkage approaches are mainly relying on similarity functions based on the surface forms of the records, and hence are not able to identify complex coreference records. This seriously limits the effectiveness of existing approaches. In this work, we propose an automatic method to extract top-k high quality transformation rules given a set of possibly coreferent record pairs. We propose an effective algorithm that performs careful local analyses for each record pair and generates candidate rules; the algorithm finally chooses top-k rules based on a scoring function. We have conducted extensive experiments on real datasets, and our proposed algorithm has substantial advantage over the previous algorithm in both effectiveness and efficiency.