RSQRT: An heuristic for estimating the number of clusters to report

Authors:
John Carlis;Kelsey Bruso
Affiliations:
Computer Science and Engineering, University of Minnesota, 4-192 Keller Hall, 200 Union St. SE, Minneapolis, MN 55455, USA;Unisys Corporation, 2470 Highcrest Rd, Roseville, MN 55113, USA
Venue:
Electronic Commerce Research and Applications
Year:
2012

Citing 10
Cited 0

Bootstrap technique in cluster analysis

Pattern Recognition
How many clusters are best?—an experiment

Pattern Recognition
Algorithms for clustering data

Algorithms for clustering data
Interactive visualization of serial periodic data

Proceedings of the 11th annual ACM symposium on User interface software and technology
Model selection for probabilistic clustering using cross-validatedlikelihood

Statistics and Computing
Quality Scheme Assessment in the Clustering Process

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Fuzzy Partitioning of Quantitative Attribute Domains by a Cluster Goodness Index

Fuzzy Partitioning of Quantitative Attribute Domains by a Cluster Goodness Index
Stability of k-means clustering

COLT'07 Proceedings of the 20th annual conference on Learning theory
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering can be a valuable tool for analyzing large data sets, such as in e-commerce applications. Anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter. Elsewhere we introduced a strongly-supported heuristic, RSQRT, which predicts K as a function of the attribute or item count, depending on attribute scales. We conducted a second analysis where we sought confirmation of the heuristic, analyzing data sets from the UCI machine learning benchmark repository. For the 25 studies where sufficient detail was available, we again found strong support. Also, in a side-by-side comparison of 28 studies, RSQRT best-predicted K and the Bayesian information criterion (BIC) predicted K are the same. RSQRT has a lower cost of O(log log n) versus O(n^2) for BIC, and is more widely applicable. Using RSQRT prospectively could be much better than merely guessing.