Model-based evaluation of clustering validation measures

Authors:
Marcel Brun;Chao Sima;Jianping Hua;James Lowey;Brent Carroll;Edward Suh;Edward R. Dougherty
Affiliations:
Translational Genomics Research Institute, Phoenix, Arizona, USA;Department of Electrical Engineering, Texas A&M University, College Station, TX, USA;Translational Genomics Research Institute, Phoenix, Arizona, USA;Translational Genomics Research Institute, Phoenix, Arizona, USA;Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA;Translational Genomics Research Institute, Phoenix, Arizona, USA;Translational Genomics Research Institute, Phoenix, Arizona, USA and Department of Electrical Engineering, Texas A&M University, College Station, TX, USA and Department of Pathology, University of ...
Venue:
Pattern Recognition
Year:
2007

Citing 6
Cited 14

Self-organizing maps

Self-organizing maps
Data clustering: a review

ACM Computing Surveys (CSUR)
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Stability-Based Model Order Selection in Clustering with Applications to Gene Expression Data

ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Landscape of Clustering Algorithms

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01

A comprehensive validity index for clustering

Intelligent Data Analysis
External validation measures for K-means clustering: A data distribution perspective

Expert Systems with Applications: An International Journal
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
The impact of ambiguity and redundancy on tag recommendation in folksonomies

Proceedings of the third ACM conference on Recommender systems
Quality indices for (practical) clustering evaluation

Intelligent Data Analysis
Multistage K-Means Clustering for Scenario Tree Construction

Informatica
Clustering web people search results using fuzzy ants

Information Sciences: an International Journal
Partitions selection strategy for set of clustering solutions

Neurocomputing
Enhancing grid-density based clustering for high dimensional data

Journal of Systems and Software
An effective evaluation measure for clustering on evolving data streams

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Active co-analysis of a set of shapes

ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia 2012
An extensive comparative study of cluster validity indices

Pattern Recognition
An ensemble clustering model for mining concept drifting stream data in emergency management

DM-IKM '12 Proceedings of the Data Mining and Intelligent Knowledge Management Workshop
Supervised clustering of label ranking data using label preference information

Machine Learning

Quantified Score

Hi-index	0.01

Visualization

Abstract

A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendall's rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.