A new initialization method for categorical data clustering

Authors:
Fuyuan Cao;Jiye Liang;Liang Bai
Affiliations:
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Taiyuan 030006, China and School of Computer and Information Technology, Shanxi University ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Taiyuan 030006, China and School of Computer and Information Technology, Shanxi University ...;School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China
Venue:
Expert Systems with Applications: An International Journal
Year:
2009

Citing 14
Cited 7

An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
An iterative initial-points refinement algorithm for categorical data clustering

Pattern Recognition Letters
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Cluster center initialization algorithm for K-means clustering

Pattern Recognition Letters
On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm

IEEE Transactions on Pattern Analysis and Machine Intelligence
A genetic fuzzy k-Modes algorithm for clustering categorical data

Expert Systems with Applications: An International Journal
Iterative optimization and simplification of hierarchical clusterings

Journal of Artificial Intelligence Research
A new initialization method for clustering categorical data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
An experimental comparison of several clustering and initialization methods

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
A fuzzy k-modes algorithm for clustering categorical data

IEEE Transactions on Fuzzy Systems

A framework for clustering categorical time-evolving data

IEEE Transactions on Fuzzy Systems
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Knowledge-Based Systems
A novel attribute weighting algorithm for clustering high-dimensional categorical data

Pattern Recognition
A two-stage genetic algorithm for automatic clustering

Neurocomputing
A cluster centers initialization method for clustering categorical data

Expert Systems with Applications: An International Journal
Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number

Pattern Recognition
A ranking-based algorithm for detection of outliers in categorical data

International Journal of Hybrid Intelligent Systems

Quantified Score

Hi-index	12.05

Visualization

Abstract

In clustering algorithms, choosing a subset of representative examples is very important in data set. Such ''exemplars'' can be found by randomly choosing an initial subset of data objects and then iteratively refining it, but this works well only if that initial choice is close to a good solution. In this paper, based on the frequency of attribute values, the average density of an object is defined. Furthermore, a novel initialization method for categorical data is proposed, in which the distance between objects and the density of the object is considered. We also apply the proposed initialization method to k-modes algorithm and fuzzy k-modes algorithm. Experimental results illustrate that the proposed initialization method is superior to random initialization method and can be applied to large data sets for its linear time complexity with respect to the number of data objects.