Projected clustering for categorical datasets

  • Authors:
  • Minho Kim;R. S. Ramakrishna

  • Affiliations:
  • Bioinformatics Team, Electronics and Telecommunications Research Institute (ETRI), 161 Gajeong-dong, Yuseong-gu, Daejeon 305-350, Republic of Korea;Department of Information and Communications, Gwangju Institute of Science and Technology, 1 Oryong-dong, Buk-gu, Gwangju 500-712, Republic of Korea

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2006

Quantified Score

Hi-index 0.10

Visualization

Abstract

This paper deals with the problem of clustering categorical datasets. Categorical data typically suffer from limited measuring levels and exhibit sparsity in a space of very high dimension. Conventional dissimilarity measures are, therefore, inadequate. We propose a new clustering algorithm based on projected clustering. The proposed algorithm, although hierarchical in essence, avoids the characteristic error propagation through reassignment and deletion of bad clusters. We also propose new indices for cluster validation in categorical datasets, an area that is almost unexplored. We present techniques for finding optimal number of clusters, and for initialization of centers of clusters. Experimental results demonstrate the effectiveness of the proposed clustering algorithm. The cluster validation for categorical datasets is also shown to be quite efficient.