DHCC: Divisive hierarchical clustering of categorical data

Authors:
Tengke Xiong;Shengrui Wang;André Mayers;Ernest Monga
Affiliations:
Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada J1K 2R1;Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada J1K 2R1;Department of Computer Science, University of Sherbrooke, Sherbrooke, Canada J1K 2R1;Department of Mathematics, University of Sherbrooke, Sherbrooke, Canada J1K 2R1
Venue:
Data Mining and Knowledge Discovery
Year:
2012

Citing 32
Cited 2

Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering transactions using large items

Proceedings of the eighth international conference on Information and knowledge management
Data clustering: a review

ACM Computing Surveys (CSUR)
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Techniques of Cluster Algorithms in Data Mining

Data Mining and Knowledge Discovery
CLOPE: a fast and effective clustering algorithm for transactional data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster merging and splitting in hierarchical clustering algorithms

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Entropy-based criterion in categorical clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
Subspace clustering for high dimensional categorical data

ACM SIGKDD Explorations Newsletter
Automated Variable Weighting in k-Means Type Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
The "Best K" for entropy-based categorical data clustering

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Efficient multidimensional data representations based on multiple correspondence analysis

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficiently clustering transactional data with weighted coverage density

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detecting anomalous records in categorical datasets

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

IEEE Transactions on Knowledge and Data Engineering
Categorical Data Clustering Using the Combinations of Attribute Values

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
On Data Labeling for Clustering Categorical Data

IEEE Transactions on Knowledge and Data Engineering
Mining Projected Clusters in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
Data discretization unification

Knowledge and Information Systems
Cluster Analysis

Cluster Analysis
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A New MCA-Based Divisive Hierarchical Algorithm for Clustering Categorical Data

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
An experimental evaluation of a Monte-Carlo algorithm for singular value decomposition

PCI'01 Proceedings of the 8th Panhellenic conference on Informatics
Particle swarm optimizer for variable weighting in clustering high-dimensional data

Machine Learning

Determining the number of clusters using information entropy for mixed data

Pattern Recognition
Central clustering of categorical data with automated feature weighting

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering categorical data poses two challenges defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different subspaces. In this paper, we propose a novel divisive hierarchical clustering algorithm for categorical data, named DHCC. We view the task of clustering categorical data from an optimization perspective, and propose effective procedures to initialize and refine the splitting of clusters. The initialization of the splitting is based on multiple correspondence analysis (MCA). We also devise a strategy for deciding when to terminate the splitting process. The proposed algorithm has five merits. First, due to its hierarchical nature, our algorithm yields a dendrogram representing nested groupings of patterns and similarity levels at different granularities. Second, it is parameter-free, fully automatic and, in particular, requires no assumption regarding the number of clusters. Third, it is independent of the order in which the data is processed. Fourth, it is scalable to large data sets. And finally, our algorithm is capable of seamlessly discovering clusters embedded in subspaces, thanks to its use of a novel data representation and Chi-square dissimilarity measures. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.