Probabilistic correlation-based similarity measure of unstructured records

Authors:
Shaoxu Song;Lei Chen
Affiliations:
Hong Kong University of Science and Technology, Hong Kong, Hong Kong;Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Venue:
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Year:
2007

Citing 7
Cited 0

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computing the similarity between unstructured records is a fundamental function in multiple applications. Approximate string matching and full text retrieval techniques do not show the best performance when applied directly, since the information are limited in unstructured records of short record length. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the exact matching tokens of two records, our similarity evaluation enriches the information of records by considering the correlations of tokens. We define the probabilistic correlation between tokens as the probability that these tokens appear in the same records. Then we compute the weight of tokens and discover the correlations of records based on the probabilistic correlations of tokens. Finally, we present extensive experimental results to demonstrate the effectiveness of our approach.