External validation measures for K-means clustering: A data distribution perspective

Authors:
Junjie Wu;Jian Chen;Hui Xiong;Ming Xie
Affiliations:
School of Economics and Management, Beihang University, Beijing 100083, China;Research Center for Contemporary Management, Key Research Institute of Humanities and Social Sciences at Universities, Tsinghua University, Beijing 100084, China;Management Science and Information Systems Department, Rutgers University, Newark 07102, NJ, USA;Research Center for Contemporary Management, Key Research Institute of Humanities and Social Sciences at Universities, Tsinghua University, Beijing 100084, China
Venue:
Expert Systems with Applications: An International Journal
Year:
2009

Citing 16
Cited 6

Algorithms for clustering data

Algorithms for clustering data
Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Elements of information theory

Elements of information theory
Data clustering: a review

ACM Computing Surveys (CSUR)
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cluster validity methods: part I

ACM SIGMOD Record
Clustering validity checking methods: part II

ACM SIGMOD Record
Performance criteria for graph clustering and Markov cluster experiments

Performance criteria for graph clustering and Markov cluster experiments
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A unified framework for model-based clustering

The Journal of Machine Learning Research
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Comparing clusterings: an axiomatic view

ICML '05 Proceedings of the 22nd international conference on Machine learning
K-means clustering versus validation measures: a data distribution perspective

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Model-based evaluation of clustering validation measures

Pattern Recognition
A Generalization of Proximity Functions for K-Means

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining

A non-parametric heuristic algorithm for convex and non-convex data clustering based on equipotential surfaces

Expert Systems with Applications: An International Journal
Validation of overlapping clustering: A random clustering perspective

Information Sciences: an International Journal
A comparative study of efficient initialization methods for the k-means clustering algorithm

Expert Systems with Applications: An International Journal
Cluster ensembles

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Clustering a very large number of textual unstructured customers' reviews in english

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Ranked k-medoids: A fast and accurate rank-based partitioning algorithm for clustering large datasets

Knowledge-Based Systems

Quantified Score

Hi-index	12.05

Visualization

Abstract

Cluster validation is an important part of any cluster analysis. External measures such as entropy, purity and mutual information are often used to evaluate K-means clustering. However, whether these measures are indeed suitable for K-means clustering remains unknown. Along this line, in this paper, we show that a data distribution view is of great use to selecting the right measures for K-means clustering. Specifically, we first introduce the data distribution view of K-means, and the resultant uniform effect on highly imbalanced data sets. Eight external measures widely used in recent data mining tasks are also collected as candidates for K-means evaluation. Then, we demonstrate that only three measures, namely the variation of information (VI), the van Dongen criterion (VD) and the Mirkin metric (M), can detect the negative uniform effect of K-means in the clustering results. We also provide new normalization schemes for these three measures, i.e., VI"n"o"r"m^', VD"n"o"r"m^' and M"n"o"r"m^', which enables the cross-data comparisons of clustering qualities. Finally, we explore some properties such as the consistency and sensitivity of the three measures, and give some advice on how to use them in K-means practice.