Determining the number of clusters using information entropy for mixed data

Authors:
Jiye Liang;Xingwang Zhao;Deyu Li;Fuyuan Cao;Chuangyin Dang
Affiliations:
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, ...;Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Hong Kong
Venue:
Pattern Recognition
Year:
2012

Citing 40
Cited 3

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A new cluster validity index for the fuzzy c-mean

Pattern Recognition Letters
Uncertainly measures of rough set prediction

Artificial Intelligence
On finding the number of clusters

Pattern Recognition Letters
Data clustering: a review

ACM Computing Surveys (CSUR)
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Data mining: concepts and techniques

Data mining: concepts and techniques
Clustering by Scale-Space Filtering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Information Theoretic Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Reinterpreting the Category Utility Function

Machine Learning
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Unsupervised Learning with Mixed Numeric and Nominal Data

IEEE Transactions on Knowledge and Data Engineering
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
The "Best K" for entropy-based categorical data clustering

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Some Equivalences between Kernel Methods and Information Theoretic Methods

Journal of VLSI Signal Processing Systems
A k-mean clustering algorithm for mixed numeric and categorical data

Data & Knowledge Engineering
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
On fuzzy cluster validity indices

Fuzzy Sets and Systems
Measures for evaluating the decision performance of a decision table in rough set theory

Information Sciences: an International Journal
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A density-based cluster validity approach using multi-representatives

Pattern Recognition Letters
A Point Symmetry-Based Clustering Technique for Automatic Evolution of Clusters

IEEE Transactions on Knowledge and Data Engineering
Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters

IEEE Transactions on Knowledge and Data Engineering
Determining the best K for clustering transactional datasets: A coverage density-based approach

Data & Knowledge Engineering
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Clustering of time series data-a survey

Pattern Recognition
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
K-centers algorithm for clustering mixed type data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Fast and robust general purpose clustering algorithms

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence
Positive approximation: An accelerator for attribute reduction in rough set theory

Artificial Intelligence
A framework for clustering categorical time-evolving data

IEEE Transactions on Fuzzy Systems
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Knowledge-Based Systems
A novel attribute weighting algorithm for clustering high-dimensional categorical data

Pattern Recognition
DHCC: Divisive hierarchical clustering of categorical data

Data Mining and Knowledge Discovery
An optimization model for outlier detection in categorical data

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Survey of clustering algorithms

IEEE Transactions on Neural Networks
Clustering mixed data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

A weighting k-modes algorithm for subspace clustering of categorical data

Neurocomputing
Multigranulation rough sets: From partition to covering

Information Sciences: an International Journal
An automatic method to determine the number of clusters using decision-theoretic rough set

International Journal of Approximate Reasoning

Quantified Score

Hi-index	0.02

Visualization

Abstract

In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Renyi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results.