Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method

Authors:
James Z. Wang;William Taylor
Affiliations:
-;-
Venue:
WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2007

Citing 0
Cited 7

Towards design principles for effective context- and perspective-based web mining

Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology
GravPad

Proceedings of the 6th International Symposium on Wikis and Open Collaboration
ImpactWheel: Visual Analysis of the Impact of Online News

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Dynamically generating context-relevant sub-webs

DESRIST'10 Proceedings of the 5th international conference on Global Perspectives on Design Science Research
Concept chaining utilizing meronyms in text characterization

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
SemaFor: semantic document indexing using semantic forests

Proceedings of the 21st ACM international conference on Information and knowledge management
The impact of conceptualization on text classification

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although using ontologies to assist information retrieval and text document processing has recently attracted more and more attention, existing ontologybased approaches have not shown advantages over the traditional keywords-based Latent Semantic Indexing (LSI) method. This paper proposes an algorithm to extract a concept forest (CF) from a document with the assistance of a natural language ontology, the WordNet lexical database. Using concept forests to represent the semantics of text documents, the semantic similarities of these documents are then measured as the commonalities of their concept forests. Performance studies of text document clustering based on different document similarity measurement methods show that the CF-based similarity measurement is an effective alternative to the existing keywords-based methods. In particular, this CFbased approach has obvious advantages over the existing keywords-based methods, including LSI, in processing short text documents or in P2P or live news environments where it is impractical to collect the entire document corpus for analysis.