SC spectra: a linear-time soft cardinality approximation for text comparison

Authors:
Sergio Jiménez Vargas;Alexander Gelbukh
Affiliations:
Intelligent Systems Research Laboratory (LISI), Systems and Industrial Engineering Department, National University of Colombia, Bogota, Colombia;Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Venue:
MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
Year:
2011

Citing 16
Cited 1

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
The resemblance coefficients in group technology: a survey and comparative study of relational metrics

Computers and Industrial Engineering
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
A Comparison of Personal Name Matching: Techniques and Practical Issues

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Formal Grammar for Hispanic Named Entities Analysis

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Generalized Mongue-Elkan Method for Approximate Text String Comparison

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A contextual normalised edit distance

ICDEW '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering Workshop
Robust similarity measures for named entities matching

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Text comparison using soft cardinality

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Clustering by compression

IEEE Transactions on Information Theory

Soft cardinality: a parameterized similarity function for text comparison

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been proposed in the past. SC Spectra is a new method of approximation in linear time for text strings, which divides text strings into consecutive substrings (i.e., q-grams) of different sizes. Thus, SC in combination with resemblance coefficients allowed the construction of a family of similarity functions for text comparison. These similarity measures have been used in the past to address a problem of entity resolution (name matching) outperforming SoftTFIDF measure. SC spectra method improves the previous results using less time and obtaining better performance. This allows the new method to be used with relatively large documents such as those included in classic information retrieval collections. SC spectra method exceeded SoftTFIDF and cosine tf-idf baselines with an approach that requires no term weighing.