Soft cardinality: a parameterized similarity function for text comparison

Authors:
Sergio Jimenez;Claudia Becerra;Alexander Gelbukh
Affiliations:
Universidad Nacional de Colombia, Bogota, Ciudad Universitaria, edificio, oficina;Universidad Nacional de Colombia, Bogota;CIC-IPN Av. Juan Dios Bátiz, Av. Mendizábal, Col. Nueva Industrial Vallejo, DF, México
Venue:
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Year:
2012

Citing 10
Cited 1

Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Algorithms for the Longest Common Subsequence Problem

Journal of the ACM (JACM)
A vector space model for automatic indexing

Communications of the ACM
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Generalized Mongue-Elkan Method for Approximate Text String Comparison

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
WordNet::Similarity: measuring the relatedness of concepts

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Measuring the semantic similarity of texts

EMSEE '05 Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment
Text comparison using soft cardinality

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
SC spectra: a linear-time soft cardinality approximation for text comparison

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
SemEval-2012 task 6: a pilot on semantic textual similarity

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation

A new benchmark dataset with production methodology for short text semantic similarity algorithms

ACM Transactions on Speech and Language Processing (TSLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an approach for the construction of text similarity functions using a parameterized resemblance coefficient in combination with a softened cardinality function called soft cardinality. Our approach provides a consistent and recursive model, varying levels of granularity from sentences to characters. Therefore, our model was used to compare sentences divided into words, and in turn, words divided into q-grams of characters. Experimentally, we observed that a performance correlation function in a space defined by all parameters was relatively smooth and had a single maximum achievable by "hill climbing." Our approach used only surface text information, a stop-word remover, and a stemmer to tackle the semantic text similarity task 6 at SEMEVAL 2012. The proposed method ranked 3rd (average), 5th (normalized correlation), and 15th (aggregated correlation) among 89 systems submitted by 31 teams.