Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Integrating feature analysis and background knowledge to recommend similarity functions
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Finding email correspondents in online social networks
World Wide Web
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
A human-machine method for web table understanding
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Scalable column concept determination for web tables using large knowledge bases
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.