New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps

Authors:
Walaa K. Gad;Mohamed S. Kamel
Affiliations:
Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada N2L 3G1;Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada N2L 3G1
Venue:
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2009

Citing 15
Cited 1

Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone

SIGDOC '86 Proceedings of the 5th annual international conference on Systems documentation
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources

IEEE Transactions on Knowledge and Data Engineering
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Verbs semantics and lexical selection

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Document Clustering with Semantic Analysis

HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 03
Evaluating WordNet-based Measures of Lexical Semantic Relatedness

Computational Linguistics
Perspectives on ontology-based querying: Research Articles

International Journal of Intelligent Systems
A concept-based model for enhancing text categorization

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
WordNet-based text document clustering

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Using measures of semantic relatedness for word sense disambiguation

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Combining social network and semantic concept analysis for personalized academic researcher recommendation

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most text clustering techniques are based on words and/or phrases weights in the text. Such representation is often unsatisfactory because it ignores the relationships between terms, and considers them as independent features. In this paper, a new semantic similarity based model (SSBM) is proposed. The semantic similarity based model computes semantic similarities by utilizing WordNet as an ontology. The proposed model captures the semantic similarities between documents that contain semantically similar terms but unnecessarily syntactically identical. The semantic similarity based model assigns a new weight to document terms reflecting the semantic relationships between terms that co-occur literally in the document. Our model in conjunction with the extended gloss overlaps measure and the adapted Lesk algorithm solves ambiguity, synonymy problems that are not detected using traditional term frequency based text mining techniques. The proposed model is evaluated on the Reuters-21578 and the 20-Newsgroups text collections datasets. The performance is assessed in terms of the Fmeasure, Purity and Entropy quality measures. The obtained results show promising performance improvements compared to the traditional term based vector space model (VSM) as well as other existing methods that include semantic similarity measures in text clustering.