Selective sampling for approximate clustering of very large data sets

  • Authors:
  • Liang Wang;James C. Bezdek;Christopher Leckie;Ramamohanarao Kotagiri

  • Affiliations:
  • Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, 3010, Australia;Department of Computer Science, University of West Florida, Pensacola, FL 32514, USA;Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, 3010, Australia;Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, 3010, Australia

  • Venue:
  • International Journal of Intelligent Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non-Euclidean relational fuzzy c-means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N × N relational matrix RN to obtain a n × n sample matrix Rn; (iii) clustering Rn with literal NERF; and (iv) extending the clusters in Rn to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding Rn, are not very practical because the sample size n often turns out to be roughly 50% of n, and this over-sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over-sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc.