A dissimilarity measure for the k-Modes clustering algorithm

Authors:
Fuyuan Cao;Jiye Liang;Deyu Li;Liang Bai;Chuangyin Dang
Affiliations:
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, ...;Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Hong Kong, China
Venue:
Knowledge-Based Systems
Year:
2012

Citing 30
Cited 5

Symbolic clustering using a new dissimilarity measure

Pattern Recognition
The formation and use of abstract concepts in design

Concept formation knowledge and experience in unsupervised learning
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
A robust and scalable clustering algorithm for mixed type attributes in large database environment

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A discrete-valued clustering algorithm with applications to biomolecular data

Information Sciences: an International Journal
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Squeezer: an efficient algorithm for clustering categorical data

Journal of Computer Science and Technology
Unsupervised Learning with Mixed Numeric and Nominal Data

IEEE Transactions on Knowledge and Data Engineering
Experiments with Incremental Concept Formation: UNIMEM

Machine Learning
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Maximal consistent block technique for rule acquisition in incomplete information systems

Information Sciences: an International Journal
Fuzzy clustering of categorical data using fuzzy centroids

Pattern Recognition Letters
Rough Set-Based Clustering with Refinement Using Shannon's Entropy Theory

Computers & Mathematics with Applications
On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm

IEEE Transactions on Pattern Analysis and Machine Intelligence
A k-mean clustering algorithm for mixed numeric and categorical data

Data & Knowledge Engineering
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
MMR: An algorithm for clustering categorical data using Rough Set Theory

Data & Knowledge Engineering
A rough set approach for selecting clustering attribute

Knowledge-Based Systems
Approximation reduction in inconsistent incomplete decision tables

Knowledge-Based Systems
A framework for clustering categorical time-evolving data

IEEE Transactions on Fuzzy Systems
Finding key attribute subset in dataset for outlier detection

Knowledge-Based Systems
DECA: A Discrete-Valued Data Clustering Algorithm

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy

IEEE Transactions on Pattern Analysis and Machine Intelligence
A fuzzy k-modes algorithm for clustering categorical data

IEEE Transactions on Fuzzy Systems
Generalizing self-organizing map for categorical data

IEEE Transactions on Neural Networks

A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data

Knowledge-Based Systems
Semantically-grounded construction of centroids for datasets with textual attributes

Knowledge-Based Systems
Rough set based fuzzy k-modes for categorical data

SEMCCO'12 Proceedings of the Third international conference on Swarm, Evolutionary, and Memetic Computing
Knowledge acquisition based on learning of maximal structure fuzzy rules

Knowledge-Based Systems
A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is one of the most important data mining techniques that partitions data according to some similarity criterion. The problems of clustering categorical data have attracted much attention from the data mining research community recently. As the extension of the k-Means algorithm, the k-Modes algorithm has been widely applied to categorical data clustering by replacing means with modes. In this paper, the limitations of the simple matching dissimilarity measure and Ng's dissimilarity measure are analyzed using some illustrative examples. Based on the idea of biological and genetic taxonomy and rough membership function, a new dissimilarity measure for the k-Modes algorithm is defined. A distinct characteristic of the new dissimilarity measure is to take account of the distribution of attribute values on the whole universe. A convergence study and time complexity of the k-Modes algorithm based on new dissimilarity measure indicates that it can be effectively used for large data sets. The results of comparative experiments on synthetic data sets and five real data sets from UCI show the effectiveness of the new dissimilarity measure, especially on data sets with biological and genetic taxonomy information.