Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Viewing morphology as an inference process
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Viewing stemming as recall enhancement
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants
ACM Transactions on Information Systems (TOIS)
An algorithm for suffix stripping
Readings in information retrieval
ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Measures of distributional similarity
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A survey of Web clustering engines
ACM Computing Surveys (CSUR)
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Ontology-Based Concept Indexing of Images
KES '09 Proceedings of the 13th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems: Part I
Survey of clustering algorithms
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
This paper focuses on processing cross-domain document repositories, which is challenged by the word ambiguity and the fact that monosemic words are more domain-oriented than polysemic ones. The paper describes a semantically enhanced text normalization algorithm (SETS) aimed at improving document clustering and investigates the performance of the sk-means clustering algorithm across domains by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 generic sub-domains of a thousand documents each randomly selected from the Reuters21578 corpus. The experimental results demonstrate improved coherence of the clusters produced by SETS compared to the text normalization obtained with the Porter stemmer. In addition, semantic-based text normalization is shown to be resistant to noise, which is often introduced in the index aggregation stage.