Linear manifold clustering in high dimensional spaces by stochastic search

  • Authors:
  • Robert Haralick;Rave Harpaz

  • Affiliations:
  • Pattern Recognition Laboratory, CS Department, The Graduate Center, City University of New York, 365 Fifth Avenue, New York, NY 10016, USA;Pattern Recognition Laboratory, CS Department, The Graduate Center, City University of New York, 365 Fifth Avenue, New York, NY 10016, USA

  • Venue:
  • Pattern Recognition
  • Year:
  • 2007

Quantified Score

Hi-index 0.01

Visualization

Abstract

Classical clustering algorithms are based on the concept that a cluster center is a single point. Clusters which are not compact around a single point are not candidates for classical clustering approaches. In this paper we present a new clustering paradigm in which the cluster center is a linear manifold. Clusters are groups of points compact around a linear manifold. A linear manifold of dimension 0 is a point. So clustering around a center point is a special case of linear manifold clustering. Linear manifold clustering (LMCLUS) identifies subsets of the data which are embedded in arbitrary oriented lower dimensional linear manifolds. Minimal subsets of points are repeatedly sampled to construct trial linear manifolds of various dimensions. Histograms of the distances of the points to each trial manifold are computed. The sampling corresponding to the histogram having the best separation between a mode near zero and the rest is selected and the data points are partitioned on the basis of the best separation. The repeated sampling then continues recursively on each block of the partitioned data. A broad evaluation of some 100 experiments over real and synthetic data sets demonstrates the general superiority of this algorithm over any of the competing algorithms in terms of accuracy and computation time. Its expected computational time is linearly proportional to the data set dimension and data set size. Its accuracy ranges from near 0.90 to 0.99 depending on the experiment and is generally much higher than the accuracy of the competing clustering algorithms.