K-means clustering versus validation measures: a data distribution perspective

Authors:
Hui Xiong;Junjie Wu;Jian Chen
Affiliations:
Rutgers University;Tsinghua University;Tsinghua University
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 7
Cited 11

Algorithms for clustering data

Algorithms for clustering data
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)

Exploratory multilevel hot spot analysis: Australian taxation office case study

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Constrained locally weighted clustering

Proceedings of the VLDB Endowment
External validation measures for K-means clustering: A data distribution perspective

Expert Systems with Applications: An International Journal
Towards understanding hierarchical clustering: A data distribution perspective

Neurocomputing
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
K-means clustering versus validation measures: a data-distribution perspective

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
COG: local decomposition for rare class analysis

Data Mining and Knowledge Discovery
Cluster analysis and fuzzy query in ship maintenance and design

ICIC'09 Proceedings of the Intelligent computing 5th international conference on Emerging intelligent computing technology and applications
An integrated model for next page access prediction

International Journal of Knowledge and Web Intelligence
Role defining using behavior-based clustering in telecommunication network

Expert Systems with Applications: An International Journal
Towards enhancing centroid classifier for text classification-A border-instance approach

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the "true" cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithm-independent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by K-means? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation(CV), is in a specific range, approximately from 0.3 to 1.0.