Comparative Advantage Approach for Sparse Text Data Clustering

  • Authors:
  • Jie Ji;Tony Y. T. Chan;Qiangfu Zhao

  • Affiliations:
  • -;-;-

  • Venue:
  • CIT '09 Proceedings of the 2009 Ninth IEEE International Conference on Computer and Information Technology - Volume 02
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document clustering is the process of partitioninga set of unlabeled n documents into clusters such that documentsin each cluster share some common concepts. Eachconcept is conveniently represented by some key terms. Usingwords as features, text data are represented as a vector in avery high dimensional vector space. However, most documentsare sparse vectors, for example, more than ten thousanddimensions and sparsity of 98%. In this paper, we study afast classification algorithm based on the idea of comparativeadvantage for clustering sparse data. The proposed algorithmuses one “ruler” instead of k centers to identify the comparativeadvantage of each cluster and define the cluster label foreach document. Experimental results show that our algorithmhas comparable performance but faster than k-means. It canproduce clusters with smaller overlapping concepts in the senseof key terms.