A weighting k-modes algorithm for subspace clustering of categorical data

  • Authors:
  • Fuyuan Cao;Jiye Liang;Deyu Li;Xingwang Zhao

  • Affiliations:
  • School of Computer and Information Technology, Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, 030006 Shanxi, ...;School of Computer and Information Technology, Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, 030006 Shanxi, ...;School of Computer and Information Technology, Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, 030006 Shanxi, ...;School of Computer and Information Technology, Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, 030006 Shanxi, ...

  • Venue:
  • Neurocomputing
  • Year:
  • 2013

Quantified Score

Hi-index 0.01

Visualization

Abstract

Traditional clustering algorithms consider all of the dimensions of an input data set equally. However, in the high dimensional data, a common property is that data points are highly clustered in subspaces, which means classes of objects are categorized in subspaces rather than the entire space. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a data set. In this paper, a weighting k-modes algorithm is presented for subspace clustering of categorical data and its corresponding time complexity is analyzed as well. In the proposed algorithm, an additional step is added to the k-modes clustering process to automatically compute the weight of all dimensions in each cluster by using complement entropy. Furthermore, the attribute weight can be used to identify the subsets of important dimensions that categorize different clusters. The effectiveness of the proposed algorithm is demonstrated with real data sets and synthetic data sets.