An iterative initial-points refinement algorithm for categorical data clustering

Authors:
Ying Sun;Qiuming Zhu;Zhengxin Chen
Affiliations:
Department of Computer Science, Digital Imaging and Computer Vision Laboratory, University of Nebraska at Omaha, Omaha;Department of Computer Science, Digital Imaging and Computer Vision Laboratory, University of Nebraska at Omaha, Omaha;Department of Computer Science, Digital Imaging and Computer Vision Laboratory, University of Nebraska at Omaha, Omaha
Venue:
Pattern Recognition Letters
Year:
2002

Citing 10
Cited 9

Algorithms for clustering data

Algorithms for clustering data
Symbolic clustering using a new dissimilarity measure

Pattern Recognition
A conceptual version of the K-means algorithm

Pattern Recognition Letters
Clustering Algorithms

Clustering Algorithms
Machine Learning and Data Mining; Methods and Applications

Machine Learning and Data Mining; Methods and Applications
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Experiments with Incremental Concept Formation: UNIMEM

Machine Learning
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

TCSOM: Clustering Transactions Using Self-Organizing Map

Neural Processing Letters
Performing clustering analysis on collaborative models

Intelligent Data Analysis
k-ANMI: A mutual information based clustering algorithm for categorical data

Information Fusion
A new initialization method for categorical data clustering

Expert Systems with Applications: An International Journal
Computation of initial modes for K-modes clustering algorithm using evidence accumulation

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A new initialization method for clustering categorical data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Knowledge-Based Systems
A cluster centers initialization method for clustering categorical data

Expert Systems with Applications: An International Journal
Attribute value weighting in k-modes clustering

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.10

Visualization

Abstract

The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes algorithm [Z. Huang, Clustering large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21-34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is the case with most data clustering algorithms, the algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. The differences on the initial points often lead to considerable distinct cluster results. In this paper we present an experimental study on applying Bradley and Fayyad's iterative initial-point refinement algorithm to the k-modes clustering to improve the accurate and repetitiveness of the clustering results [cf. P. Bradley, U. Fayyad, Refining initial points for k-mean clustering, in: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1998]. Experiments show that the k-modes clustering algorithm using refined initial points leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many data mining applications with categorical data.