“Best K”: critical clustering structures in categorical datasets

Authors:
Keke Chen;Ling Liu
Affiliations:
Wright State University, Department of Computer Science and Engineering, Dayton, OH, USA;Georgia Institute of Technology, College of Computing, Atlanta, GA, USA
Venue:
Knowledge and Information Systems
Year:
2009

Citing 23
Cited 3

Algorithms for clustering data

Algorithms for clustering data
Elements of information theory

Elements of information theory
Applied multivariate techniques

Applied multivariate techniques
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
An entropic estimator for structure discovery

Proceedings of the 1998 conference on Advances in neural information processing systems II
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Cluster validity methods: part I

ACM SIGMOD Record
Applications of Data Mining in Computer Security

Applications of Data Mining in Computer Security
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Finding Localized Associations in Market Basket Data

IEEE Transactions on Knowledge and Data Engineering
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Entropy-based criterion in categorical clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
VISTA: validating and refining clusters via visualization

Information Visualization
On efficiently summarizing categorical databases

Knowledge and Information Systems
The "Best K" for entropy-based categorical data clustering

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Finding centric local outliers in categorical/numerical spaces

Knowledge and Information Systems
Non-redundant data clustering

Knowledge and Information Systems

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Knowledge-Based Systems
A novel attribute weighting algorithm for clustering high-dimensional categorical data

Pattern Recognition
MAR: Maximum Attribute Relative of soft set for clustering attribute selection

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The demand on cluster analysis for categorical data continues to grow over the last decade. A well-known problem in categorical clustering is to determine the best K number of clusters. Although several categorical clustering algorithms have been developed, surprisingly, none has satisfactorily addressed the problem of best K for categorical clustering. Since categorical data does not have an inherent distance function as the similarity measure, traditional cluster validation techniques based on geometric shapes and density distributions are not appropriate for categorical data. In this paper, we study the entropy property between the clustering results of categorical data with different K number of clusters, and propose the BKPlot method to address the three important cluster validation problems: (1) How can we determine whether there is significant clustering structure in a categorical dataset? (2) If there is significant clustering structure, what is the set of candidate “best Ks”? (3) If the dataset is large, how can we efficiently and reliably determine the best Ks?