Estimating the predominant number of clusters in a dataset

Authors:
Jamil Al Shaqsi;Wenjia Wang
Affiliations:
Department of Information Systems, Sultan Qaboos University, Muscat, Oman;School of Computing Sciences, University of East Anglia, Norwich, UK
Venue:
Intelligent Data Analysis
Year:
2013

Citing 10
Cited 0

A robust and scalable clustering algorithm for mixed type attributes in large database environment

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Algorithms

Clustering Algorithms
Model selection for probabilistic clustering using cross-validatedlikelihood

Statistics and Computing
The Kindest Cut: Minimum Message Length Segmentation

ALT '96 Proceedings of the 7th International Workshop on Algorithmic Learning Theory
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Estimating the number of segments in time series data using permutation tests

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
Subspace Information Criterion for Model Selection

Neural Computation
Clustering aggregation

ACM Transactions on Knowledge Discovery from Data (TKDD)
An evaluation of criteria for measuring the quality of clusters

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

In cluster analysis, finding the number of clusters, K, for a given dataset is an important yet very tricky task, simply for the facts that there is no universally accepted correct or wrong answer for most real world problems and it all depends on the context and purpose of a cluster study. Numerous methods have been developed for estimating K, but most are not widely used in practice due to their poor performance. Thus, it is still quite common that human user is required to select a specific value or a range for K for many clustering methods before they are used. Inappropriate predetermination for K can result in poor clustering results. This paper presents a new method for estimating the most probable number of clusters automatically. It firstly calculates the length of constant similarity intervals, L, and then considers the longest ones as the representations of the most probable numbers of the clusters under the set context and the chosen similarity measure. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods including particularly the TwoStep implemented in IBM/SPSS Modeler software package. The experimental results showed that the proposed method is able to find the "desired" predominant number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.