Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

  • Authors:
  • Ivan Stankov;Diman Todorov;Rossitza Setchi

  • Affiliations:
  • Knowledge Engineering Systems Group, School of Engineering, Cardiff University, Cardiff, UK;Knowledge Engineering Systems Group, School of Engineering, Cardiff University, Cardiff, UK;Knowledge Engineering Systems Group, School of Engineering, Cardiff University, Cardiff, UK

  • Venue:
  • International Journal of Knowledge-based and Intelligent Engineering Systems - Selected papers of KES2012-Part 2 of 2
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The aim of document clustering is to produce coherent clusters of similar documents. Clustering algorithms rely on text normalisation techniques to represent and cluster documents. Although most document clustering algorithms perform well in specific knowledge domains, processing cross-domain document repositories is still a challenge. This paper attempts to address this challenge. It investigates the performance of the sk-means clustering algorithm across domains, by comparing the cluster coherence produced with semantic-based and traditional TF-IDF-based document representations. The evaluation is conducted on 20 different generic sub-domains of a thousand documents, each randomly selected from the Reuters21578 corpus. The experimental results obtained from the evaluation demonstrate improved coherence of clusters produced by using a semantically enhanced text stemmer SETS, when compared to the text normalisation obtained with the Porter stemmer. In addition, semantic-based text normalisation is shown to be resistant to noise, which is often introduced in the index aggregation stage, a stage that acquires features to represent documents.