A statistical model of cluster stability

Authors:
Z. Volkovich;Z. Barzily;L. Morozensky
Affiliations:
Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel and Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimor ...;Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimore, MD 20250, USA;Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimore, MD 20250, USA
Venue:
Pattern Recognition
Year:
2008

Citing 16
Cited 10

Bootstrap technique in cluster analysis

Pattern Recognition
Algorithms for clustering data

Algorithms for clustering data
A Classification EM algorithm for clustering and two stochastic versions

Computational Statistics & Data Analysis - Special issue on optimization techniques in statistics
Invited Article: Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions

Neural Networks
Identifying genuine clusters in a classification

Computational Statistics & Data Analysis
Concept decompositions for large sparse text data using clustering

Machine Learning
Cluster analysis: a further approach based on density estimation

Computational Statistics & Data Analysis
Clustering Algorithms

Clustering Algorithms
Uniformity Testing Using Minimal Spanning Tree

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 4 - Volume 4
Text Mining with Information-Theoretic Clustering

Computing in Science and Engineering
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
On a new multivariate two-sample test

Journal of Multivariate Analysis
Ensembles of Partitions via Data Resampling

ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
Stability-based validation of clustering solutions

Neural Computation
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation
Scale-based clustering using the radial basis function network

IEEE Transactions on Neural Networks

On a Minimal Spanning Tree Approach in the Cluster Validation Problem

Informatica
A Hopfield Neural Network for combining classifiers applied to textured images

Neural Networks
A linguistic approach to classification of bacterial genomes

Pattern Recognition
A randomized algorithm for estimating the number of clusters

Automation and Remote Control
MiniMax ε-stable cluster validity index for Type-2 fuzziness

Information Sciences: an International Journal
A combined strategy using FMCDM for textures segmentation in hemispherical images from forest environments

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Stochastic approximation learning for mixtures of multivariate elliptical distributions

Neurocomputing
Self-learning K-means clustering: a global optimization approach

Journal of Global Optimization
Detecting changes in time series: A product partition model with across-cluster correlation

Signal Processing
A binomial noised model for cluster validation

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Recent Advances in Soft Computing: Theories and Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

In the current paper we present a method for assessing cluster stability. This method, combined with a clustering algorithm, yields an estimate of the data partition, namely, the number of clusters. We adopt the cluster stability standpoint where clusters are imagined as islands of ''high'' density in a sea of ''low'' density. Explicitly, a cluster is associated with its high density core. Our approach offers to evaluate the goodness of a cluster by the similarity amongst the entire cluster and its core. We propose to measure this resemblance by two-sample tests or by probability distances between appropriate probability distributions. The distances are calculated on clustered samples drawn from the source population according to two different distributions. The first law is the underlying set distribution. The second law is constructed so that it represents the clusters' cores. Here, a variant of the k-nearest neighbor density estimation is applied, so that items belonging to cores have a much higher chance to be selected. As the sample distribution is unknown a distribution-free two-sample test is required to examine the mentioned correspondence. For constructing such a test, we use distance functions built on negative definite kernels. In practice, outliers in the samples and limitations of the clustering algorithm heavily contribute to the noise level. As a result of this shortcoming the distance values have to be determined for many pairs of samples and therefore an empirical distance's distribution is obtained. The distribution is dependent on the examined number of clusters. To prevent this property for biasing the results we normalize the distances. It is conjectured that the true number of clusters yields the most concentrated normalized distribution. To measure the concentration we use the sample mean and the sample 25th percentile. The paper exhibits the good performance of the proposed method on synthetic and real-world data.