How Many Clusters? An Information-Theoretic Perspective

Authors:
Susanne Still;William Bialek
Affiliations:
Department of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, U.S.A.;Department of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, U.S.A.
Venue:
Neural Computation
Year:
2004

Citing 4
Cited 13

The upward bias in measures of information derived from limited data samples

Neural Computation
Statistical inference, Occam's razor, and statistical mechanics on the space of probability distributions

Neural Computation
Model selection for probabilistic clustering using cross-validatedlikelihood

Statistics and Computing
Stability-based validation of clustering solutions

Neural Computation

A Robust Information Clustering Algorithm

Neural Computation
A tutorial on spectral clustering

Statistics and Computing
RIC: Parameter-free noise-robust clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Improving cluster visualization in self-organizing maps: Application in gene expression data analysis

Computers in Biology and Medicine
Recommendation system based on the clustering of frequent sets

WSEAS Transactions on Information Science and Applications
Self-supervised acquisition of vowels in American English

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Cross-modal clustering

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
ITCH: information-theoretic cluster hierarchies

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
Automatically determining the number of clusters using decision-theoretic rough set

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
Probabilistic prediction of protein phosphorylation sites using classification relevance units machines

ACM SIGAPP Applied Computing Review
Unsupervised classification and visualization of unstructured text for the support of interdisciplinary collaboration

Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing
An automatic method to determine the number of clusters using decision-theoretic rough set

International Journal of Approximate Reasoning
A binomial noised model for cluster validation

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Recent Advances in Soft Computing: Theories and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias that arises due to sampling errors will allow us to find a clustering solution at a temperature that is optimal in the sense that we capture maximal meaningful structure—without having to define an external criterion for the goodness or stability of the clustering. We show that in a general information-theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters that can be resolved from the data in the hard clustering limit.