Subspace clustering of text documents with feature weighting k-means algorithm

  • Authors:
  • Liping Jing;Michael K. Ng;Jun Xu;Joshua Zhexue Huang

  • Affiliations:
  • Department of Mathematics, The University of Hong Kong, HongKong, China;Department of Mathematics, The University of Hong Kong, HongKong, China;E-Business Technology Institute, The University of Hong Kong, Hong Kong, China;E-Business Technology Institute, The University of Hong Kong, Hong Kong, China

  • Venue:
  • PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents a new method to solve the problem of clustering large and complex text data. The method is based on a new subspace clustering algorithm that automatically calculates the feature weights in the k-means clustering process. In clustering sparse text data the feature weights are used to discover clusters from subspaces of the document vector space and identify key words that represent the semantics of the clusters. We present a modification of the published algorithm to solve the sparsity problem that occurs in text clustering. Experimental results on real-world text data have shown that the new method outperformed the Standard KMeans and Bisection-KMeans algorithms, while still maintaining efficiency of the k-means clustering process.