DHCC: Divisive hierarchical clustering of categorical data

  • Authors:
  • Tengke Xiong;Shengrui Wang;André Mayers;Ernest Monga

  • Affiliations:
  • Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada J1K 2R1;Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada J1K 2R1;Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada J1K 2R1;Department of Mathematics, University of Sherbrooke, Sherbrooke, Canada J1K 2R1

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering categorical data poses two challenges defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different subspaces. In this paper, we propose a novel divisive hierarchical clustering algorithm for categorical data, named DHCC. We view the task of clustering categorical data from an optimization perspective, and propose effective procedures to initialize and refine the splitting of clusters. The initialization of the splitting is based on multiple correspondence analysis (MCA). We also devise a strategy for deciding when to terminate the splitting process. The proposed algorithm has five merits. First, due to its hierarchical nature, our algorithm yields a dendrogram representing nested groupings of patterns and similarity levels at different granularities. Second, it is parameter-free, fully automatic and, in particular, requires no assumption regarding the number of clusters. Third, it is independent of the order in which the data is processed. Fourth, it is scalable to large data sets. And finally, our algorithm is capable of seamlessly discovering clusters embedded in subspaces, thanks to its use of a novel data representation and Chi-square dissimilarity measures. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.