Information theoretic measures for clusterings comparison: is a correction for chance necessary?

Authors:
Nguyen Xuan Vinh;Julien Epps;James Bailey
Affiliations:
The University of New South Wales, Sydney, Australia & ATP Laboratory, National ICT Australia (NICTA);The University of New South Wales, Sydney, Australia & ATP Laboratory, National ICT Australia (NICTA);The University of Melbourne, Australia & Victoria Research Laboratory, National ICT Australia
Venue:
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Year:
2009

Citing 7
Cited 30

Elements of information theory

Elements of information theory
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

The Journal of Machine Learning Research
Comparing clusterings: an axiomatic view

ICML '05 Proceedings of the 22nd international conference on Machine learning
Graph-based consensus clustering for class discovery from gene expression data

Bioinformatics
A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering

BIBE '09 Proceedings of the 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering

Clustering by synchronization

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

The Journal of Machine Learning Research
Applications of graph theory to an English rhyming corpus

Computer Speech and Language
A methodology to find clusters in the data based on Shannon's entropy and genetic algorithms

ACELAE'11 Proceedings of the 10th WSEAS international conference on communications, electrical & computer engineering, and 9th WSEAS international conference on Applied electromagnetics, wireless and optical communications
An efficient hyperellipsoidal clustering algorithm for resource-constrained environments

Pattern Recognition
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
INCONCO: interpretable clustering of numerical and categorical objects

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A novel similarity measure for fiber clustering using longest common subsequence

Proceedings of the 2011 workshop on Data mining for medicine and healthcare
Using a Wikipedia-based semantic relatedness measure for document clustering

TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
Clustering for semi-supervised spam filtering

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Characterizing cell types through differentially expressed gene clusters using a model-based approach

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
Who wrote this code? identifying the authors of program binaries

ESORICS'11 Proceedings of the 16th European conference on Research in computer security
Generalized Adjusted Rand Indices for cluster ensembles

Pattern Recognition
Hierarchical verb clustering using graph factorization

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Using Clustering and Metric Learning to Improve Science Return of Remote Sensed Imagery

ACM Transactions on Intelligent Systems and Technology (TIST)
Dynamic bayesian network modeling of cyanobacterial biological processes via gene clustering

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part I
A set correlation model for partitional clustering

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Clustering of heterogeneously typed data with soft computing - a case study

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
A New Unsupervised Feature Ranking Method for Gene Expression Data Based on Consensus Affinity

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Dependency clustering across measurement scales

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarization-based mining bipartite graphs

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering through SOM consistency

ICIAR'12 Proceedings of the 9th international conference on Image Analysis and Recognition - Volume Part I
A heuristic for non-convex variance-based clustering criteria

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
ESC: An efficient synchronization-based clustering algorithm

Knowledge-Based Systems
High order pLSA for indexing tagged images

Signal Processing
Data clustering using controlled consensus in complex networks

Neurocomputing
Effective measures for inter-document similarity

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Alternate views of graph clusterings based on thresholds: a case study for a student forum

Proceedings of the sixth workshop on Ph.D. students in information and knowledge management
Multi-stage filtering for improving confidence level and determining dominant clusters in clustering algorithms of gene expression data

Computers in Biology and Medicine
Enhancing K-Means using class labels

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.