Learning top-k transformation rules

Authors:
Sunanda Patro;Wei Wang
Affiliations:
University of New South Wales;University of New South Wales
Venue:
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Year:
2011

Citing 20
Cited 0

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A Heterogeneous Field Matching Method for Record Linkage

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Incorporating string transformations in record matching

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Optimizing relevance and revenue in ad search: a query substitution approach

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Robust similarity measures for named entities matching

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Learning string transformations from examples

Proceedings of the VLDB Endowment
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record linkage identifies multiple records referring to the same entity even if they are not bit-wise identical. It is thus an essential technology for data integration and data cleansing. Existing record linkage approaches are mainly relying on similarity functions based on the surface forms of the records, and hence are not able to identify complex coreference records. This seriously limits the effectiveness of existing approaches. In this work, we propose an automatic method to extract top-k high quality transformation rules given a set of possibly coreferent record pairs. We propose an effective algorithm that performs careful local analyses for each record pair and generates candidate rules; the algorithm finally chooses top-k rules based on a scoring function. We have conducted extensive experiments on real datasets, and our proposed algorithm has substantial advantage over the previous algorithm in both effectiveness and efficiency.