The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Automating the approximate record-matching process
Information Sciences—Informatics and Computer Science: An International Journal
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Automated name authority control
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A program for aligning sentences in bilingual corpora
Computational Linguistics - Special issue on using large corpora: I
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Disambiguating Web appearances of people in a social network
WWW '05 Proceedings of the 14th international conference on World Wide Web
Name disambiguation in author citations using a K-way spectral clustering method
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Effective and scalable solutions for mixed and split citation problems in digital libraries
Proceedings of the 2nd international workshop on Information quality in information systems
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Communications of the ACM
Constraint-based entity matching
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
IEEE Transactions on Information Theory
An approach to conversational agent design using semantic sentence similarity
Applied Intelligence
Hi-index | 0.00 |
To see if two given strings are matched, various string similarity metrics have been employed and these string similarities can be categorized into three classes: (a) Edit-distance-based similarities, (b) Token-based similarities, and (c) Hybrid similarities. In essence, since different types of string similarities have different pros and cons in measuring the similarity between two strings, string similarity metrics in each class are likely to work well for particular data sets. Toward this problem, we propose a novel Meta Similarity that both (i) outperforms the existing similarity metrics and (ii) is the least affected by a variety of data sets. Our claim is empirically validated through extensive experimental tests--our proposal shows an improvement to the largest 20% average recall, compared to the best case of the existing similarity metrics and our method is the most stable, showing from 0.95 to 1.0 average recall range in all the data sets.