Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Authors:
Nguyen Xuan Vinh;Julien Epps;James Bailey
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2010

Citing 18
Cited 11

Elements of information theory

Elements of information theory
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Clustering with Qualitative Information

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

The Journal of Machine Learning Research
Comparing clusterings: an axiomatic view

ICML '05 Proceedings of the 22nd international conference on Machine learning
Comparing clusterings---an information based distance

Journal of Multivariate Analysis
An ensemble framework for clustering protein–protein interaction networks

Bioinformatics
Graph-based consensus clustering for class discovery from gene expression data

Bioinformatics
k-ANMI: A mutual information based clustering algorithm for categorical data

Information Fusion
Ensemble clustering with voting active clusters

Pattern Recognition Letters
Information theoretic measures for clusterings comparison: is a correction for chance necessary?

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Information-Theoretic Distance Measures for Clustering Validation: Generalization and Normalization

IEEE Transactions on Knowledge and Data Engineering
A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering

BIBE '09 Proceedings of the 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering
Ensemble clustering using semidefinite programming with applications

Machine Learning
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory
The similarity metric

IEEE Transactions on Information Theory

Clustering geo-tagged photo collections using dynamic programming

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Dynamic bayesian network modeling of cyanobacterial biological processes via gene clustering

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part I
Enhancing search result clustering with semantic indexing

Proceedings of the Third Symposium on Information and Communication Technology
Data discretization for dynamic bayesian network based modeling of genetic networks

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part II
Relative Validity Criteria for Community Mining Algorithms

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Alternate views of graph clusterings based on thresholds: a case study for a student forum

Proceedings of the sixth workshop on Ph.D. students in information and knowledge management
On the statistical detection of clusters in undirected networks

Computational Statistics & Data Analysis
Adaptive thresholding in structure learning of a Bayesian network

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
A statistical view of clustering performance through the theory of U-processes

Journal of Multivariate Analysis
A study of K-Means-based algorithms for constrained clustering

Intelligent Data Analysis
Enhancing K-Means using class labels

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants.