Meta similarity

Authors:
Byung-Won On;Ingyu Lee
Affiliations:
Singapore Management University, Singapore, Singapore;Troy University, Troy, USA
Venue:
Applied Intelligence
Year:
2011

Citing 23
Cited 3

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Automating the approximate record-matching process

Information Sciences—Informatics and Computer Science: An International Journal
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Automated name authority control

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Digital Libraries and Autonomous Citation Indexing

Computer
The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Disambiguating Web appearances of people in a social network

WWW '05 Proceedings of the 14th international conference on World Wide Web
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Are your citations clean?

Communications of the ACM
Constraint-based entity matching

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
The similarity metric

IEEE Transactions on Information Theory

Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph

Applied Intelligence
An approach to conversational agent design using semantic sentence similarity

Applied Intelligence
Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

To see if two given strings are matched, various string similarity metrics have been employed and these string similarities can be categorized into three classes: (a) Edit-distance-based similarities, (b) Token-based similarities, and (c) Hybrid similarities. In essence, since different types of string similarities have different pros and cons in measuring the similarity between two strings, string similarity metrics in each class are likely to work well for particular data sets. Toward this problem, we propose a novel Meta Similarity that both (i) outperforms the existing similarity metrics and (ii) is the least affected by a variety of data sets. Our claim is empirically validated through extensive experimental tests--our proposal shows an improvement to the largest 20% average recall, compared to the best case of the existing similarity metrics and our method is the most stable, showing from 0.95 to 1.0 average recall range in all the data sets.