Determining the best K for clustering transactional datasets: A coverage density-based approach

Authors:
Hua Yan;Keke Chen;Ling Liu;Joonsoo Bae
Affiliations:
Computational Intelligence Laboratory, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China;Department of Computer Science and Engineering, Wright State University, Dayton OH 45435, USA;College of Computing, Georgia Institute of Technology, Atlanta, GA 30280, USA;Department of Industrial and Information Systems Engineering, Chonbuk National University, South Korea
Venue:
Data & Knowledge Engineering
Year:
2009

Citing 21
Cited 4

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Using association rules for product assortment decisions: a case study

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster validity methods: part I

ACM SIGMOD Record
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Finding Localized Associations in Market Basket Data

IEEE Transactions on Knowledge and Data Engineering
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
CLOPE: a fast and effective clustering algorithm for transactional data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Entropy-based criterion in categorical clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Categorical data visualization and clustering using subjective factors

Data & Knowledge Engineering
VISTA: validating and refining clusters via visualization

Information Visualization
The "Best K" for entropy-based categorical data clustering

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Efficiently clustering transactional data with weighted coverage density

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A k-mean clustering algorithm for mixed numeric and categorical data

Data & Knowledge Engineering
MMR: An algorithm for clustering categorical data using Rough Set Theory

Data & Knowledge Engineering

An efficient preprocessing stage for the relationship-based clustering framework

Intelligent Data Analysis
Automatic threshold estimation for data matching applications

Information Sciences: an International Journal
Determining the number of clusters using information entropy for mixed data

Pattern Recognition
A self-organizing map for transactional data and the related categorical domain

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of determining the optimal number of clusters is important but mysterious in cluster analysis. In this paper, we propose a novel method to find a set of candidate optimal number Ks of clusters in transactional datasets. Concretely, we propose Transactional-cluster-modes Dissimilarity based on the concept of coverage density as an intuitive transactional inter-cluster dissimilarity measure. Based on the above measure, an agglomerative hierarchical clustering algorithm is developed and the Merging Dissimilarity Indexes, which are generated in hierarchical cluster merging processes, are used to find the candidate optimal number Ks of clusters of transactional data. Our experimental results on both synthetic and real data show that the new method often effectively estimates the number of clusters of transactional data.