Knowledge-based vector space model for text clustering

Authors:
Liping Jing;Michael K. Ng;Joshua Z. Huang
Affiliations:
Beijing Jiaotong University, School of Computer and Information Technology, Beijing, China;Hong Kong Baptist University, Department of Mathematics, Kowloon Tong, Hong Kong, China;The University of Hong Kong, E-Business Technology Institute, Pokfulam Road, Hong Kong, China
Venue:
Knowledge and Information Systems
Year:
2010

Citing 0
Cited 7

High-order co-clustering text data on semantics-based representation model

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Text clustering based on granular computing and wikipedia

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
A holistic semantic similarity measure for viewports in interactive maps

W2GIS'12 Proceedings of the 11th international conference on Web and Wireless Geographical Information Systems
A relation extraction method of Chinese named entities based on location and semantic features

Applied Intelligence
Emergent self organizing maps for text cluster visualization by incorporating ontology based descriptors

SEAL'12 Proceedings of the 9th international conference on Simulated Evolution and Learning
An effective query recommendation approach using semantic strategies for intelligent information retrieval

Expert Systems with Applications: An International Journal
Semantic smoothing for text clustering

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.