Fast document clustering based on weighted comparative advantage
SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Hi-index | 0.00 |
Document clustering is the process of partitioninga set of unlabeled n documents into clusters such that documentsin each cluster share some common concepts. Eachconcept is conveniently represented by some key terms. Usingwords as features, text data are represented as a vector in avery high dimensional vector space. However, most documentsare sparse vectors, for example, more than ten thousanddimensions and sparsity of 98%. In this paper, we study afast classification algorithm based on the idea of comparativeadvantage for clustering sparse data. The proposed algorithmuses one “ruler” instead of k centers to identify the comparativeadvantage of each cluster and define the cluster label foreach document. Experimental results show that our algorithmhas comparable performance but faster than k-means. It canproduce clusters with smaller overlapping concepts in the senseof key terms.