Clustering Large Categorical Data

Authors:
François-Xavier Jollois;Mohamed Nadif
Affiliations:
-;-
Venue:
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2002

Citing 4
Cited 6

A Classification EM algorithm for clustering and two stochastic versions

Computational Statistics & Data Analysis - Special issue on optimization techniques in statistics
A conceptual version of the K-means algorithm

Pattern Recognition Letters
Comparison of the mixture and the classification maximum likelihood in cluster analysis with binary data

Computational Statistics & Data Analysis - Special issue on classification
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery

TCSOM: Clustering Transactions Using Self-Organizing Map

Neural Processing Letters
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
k-ANMI: A mutual information based clustering algorithm for categorical data

Information Fusion
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
A new feature weighted fuzzy clustering algorithm

RSFDGrC'05 Proceedings of the 10th international conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing - Volume Part I
Weighted topological clustering for categorical data

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering methods often come down to the optimization of a numeric criterion defined from a distance or from a dissimilarity measure. It is possible to show that this problem is often equivalent to the estimation of the parameters of a probabilistic model under the classification likelihood approach. For instance, we know that the inertia criterion optimized under the k-means algorithm corresponds to the hypothesis of a population arising from a Gaussian mixture. In this paper, we propose an adapted mixture model for categorical data. Using the classification likelihood approach, we develop the Classification EM algorithm (CEM) to estimate the parameters of the mixture model. With our probabilistic model, the data are not denatured and the estimated parameters readily indicate the characteristics of the clusters. This probabilistic approach gives an interpretation of the criterion optimized by the k-modes algorithm which is an extension of k-means to categorical attributes and allows us to study the behavior of this algorithm.