A data labeling method for clustering categorical data

Authors:
Fuyuan Cao;Jiye Liang
Affiliations:
School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, China and Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of E ...;School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, China and Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of E ...
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 15
Cited 1

Algorithms for clustering data

Algorithms for clustering data
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data clustering: a review

ACM Computing Surveys (CSUR)
Data mining: concepts and techniques

Data mining: concepts and techniques
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Data Mining: An Overview from a Database Perspective

IEEE Transactions on Knowledge and Data Engineering
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Fuzzy clustering of categorical data using fuzzy centroids

Pattern Recognition Letters
Rough Set-Based Clustering with Refinement Using Shannon's Entropy Theory

Computers & Mathematics with Applications
On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm

IEEE Transactions on Pattern Analysis and Machine Intelligence
On Data Labeling for Clustering Categorical Data

IEEE Transactions on Knowledge and Data Engineering
A new measure of uncertainty based on knowledge granulation for rough sets

Information Sciences: an International Journal
An initialization method for the K-Means algorithm using neighborhood model

Computers & Mathematics with Applications
A fuzzy k-modes algorithm for clustering categorical data

IEEE Transactions on Fuzzy Systems
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Data mining and statistical techniques for characterizing the performance of thin-film photovoltaic modules

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.06

Visualization

Abstract

As the size of data growing at a rapid pace, clustering a very large data set inevitably incurs a time-consuming process. To improve the efficiency of clustering, sampling is usually used to scale down the size of data set. However, with sampling applied, how to allocate unlabeled objects into proper clusters is a very difficult problem. In this paper, based on the frequency of attribute values in a given cluster and the distributions of attribute values in different clusters, a novel similarity measure is proposed to allocate each unlabeled object into the corresponding appropriate cluster for clustering categorical data. Furthermore, a labeling algorithm for categorical data is presented, and its corresponding time complexity is analyzed as well. The effectiveness of the proposed algorithm is shown by the experiments on real-world data sets.