An optimized k-means algorithm of reducing cluster intra-dissimilarity for document clustering

  • Authors:
  • Daling Wang;Ge Yu;Yubin Bao;Meng Zhang

  • Affiliations:
  • School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China

  • Venue:
  • WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the high-dimension and sparseness properties of documents, clustering the similar documents together is a tough task. The most popular document clustering method K-Means has the shortcoming of its cluster intra-dissimilarity, i.e. inclining to clustering unrelated documents together. One of the reasons is that all objects (documents) in a cluster produce the same influence to the mean of the cluster. SOM (Self Organizing Map) is a method to reduce the dimension of data and display the data in low dimension space, and it has been applied successfully to clustering of high-dimensional objects. The scalar factor is an important part of SOM. In this paper, an optimized K-Means algorithm is proposed. It introduces the scalar factor from SOM into means during K-Means assignment stage for controlling the influence to the means from new objects. Experiments show that the optimized K-Means algorithm has more F-Measure and less Entropy of clustering than standard K-Means algorithm, thereby reduces the intra-dissimilarity of clusters effectively.