A statistical model of cluster stability

  • Authors:
  • Z. Volkovich;Z. Barzily;L. Morozensky

  • Affiliations:
  • Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel and Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimor ...;Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimore, MD 20250, USA;Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimore, MD 20250, USA

  • Venue:
  • Pattern Recognition
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

In the current paper we present a method for assessing cluster stability. This method, combined with a clustering algorithm, yields an estimate of the data partition, namely, the number of clusters. We adopt the cluster stability standpoint where clusters are imagined as islands of ''high'' density in a sea of ''low'' density. Explicitly, a cluster is associated with its high density core. Our approach offers to evaluate the goodness of a cluster by the similarity amongst the entire cluster and its core. We propose to measure this resemblance by two-sample tests or by probability distances between appropriate probability distributions. The distances are calculated on clustered samples drawn from the source population according to two different distributions. The first law is the underlying set distribution. The second law is constructed so that it represents the clusters' cores. Here, a variant of the k-nearest neighbor density estimation is applied, so that items belonging to cores have a much higher chance to be selected. As the sample distribution is unknown a distribution-free two-sample test is required to examine the mentioned correspondence. For constructing such a test, we use distance functions built on negative definite kernels. In practice, outliers in the samples and limitations of the clustering algorithm heavily contribute to the noise level. As a result of this shortcoming the distance values have to be determined for many pairs of samples and therefore an empirical distance's distribution is obtained. The distribution is dependent on the examined number of clusters. To prevent this property for biasing the results we normalize the distances. It is conjectured that the true number of clusters yields the most concentrated normalized distribution. To measure the concentration we use the sample mean and the sample 25th percentile. The paper exhibits the good performance of the proposed method on synthetic and real-world data.