Fast document clustering based on weighted comparative advantage

  • Authors:
  • Jie Ji;Tony Y. T. Chan;Qiangfu Zhao

  • Affiliations:
  • Intelligent System Lab, The University of Aizu, Aizuwakamatsu, Fukushima, Japan;School of Computing, The University of Akureyri, Iceland;Intelligent System Lab, The University of Aizu, Aizuwakamatsu, Fukushima, Japan

  • Venue:
  • SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document clustering is the process of partitioning a set of unlabeled documents into clusters such that documents within each cluster share some common concepts. To help with this analysis, concepts are conveniently represented using some key terms. For clustering algorithm, the most costly CPU time has to do with the classification phase. Using words as features, text data are represented in a very high dimensional vector space. We have studied a comparative advantage based algorithm for clustering sparse data in this space, it used one "ruler" instead of k centers to identify the comparative advantage of each cluster and define the cluster label for each document. However, this algorithm only considered the relative strength between clusters, the relationship between terms was ignored. In this paper, we proposed a weighted comparative advantage based clustering algorithm. The experimental results based on SMART system databases show that the new algorithm is better than simple comparative advantage algorithm, without any extra computation time. Compare with k-means, not only can it get comparable results but it can also significantly accelerate the clustering procedure.