On ontology-driven document clustering using core semantic features

Authors:
Samah Fodeh;Bill Punch;Pang-Ning Tan
Affiliations:
Yale University, New Haven, CT, USA;Michigan State University, East Lansing, MI, USA;Michigan State University, East Lansing, MI, USA
Venue:
Knowledge and Information Systems - Special Issue on "Context-Aware Data Mining (CADM)"
Year:
2011

Citing 0
Cited 6

Enriching short text representation in microblog for clustering

Frontiers of Computer Science in China
Hierarchically clustered technical blogs

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Identifying conceptual layers in the ontology development process

SETN'12 Proceedings of the 7th Hellenic conference on Artificial Intelligence: theories and applications
Emergent self organizing maps for text cluster visualization by incorporating ontology based descriptors

SEAL'12 Proceedings of the 9th international conference on Simulated Evolution and Learning
Semantic smoothing for text clustering

Knowledge-Based Systems
Clustering web documents using hierarchical representation with multi-granularity

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.