Selective sampling for approximate clustering of very large data sets

Authors:
Liang Wang;James C. Bezdek;Christopher Leckie;Ramamohanarao Kotagiri
Affiliations:
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, 3010, Australia;Department of Computer Science, University of West Florida, Pensacola, FL 32514, USA;Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, 3010, Australia;Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, 3010, Australia
Venue:
International Journal of Intelligent Systems
Year:
2008

Citing 0
Cited 5

Approximate Spectral Clustering

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
eCCV: a new fuzzy cluster validity measure for large relational bioinformatics datasets

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Fuzzy clustering with weighted medoids for relational data

Pattern Recognition
Approximate pairwise clustering for large data sets via sampling plus extension

Pattern Recognition
Vector quantization based approximate spectral clustering of large datasets

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non-Euclidean relational fuzzy c-means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N × N relational matrix RN to obtain a n × n sample matrix Rn; (iii) clustering Rn with literal NERF; and (iv) extending the clusters in Rn to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding Rn, are not very practical because the sample size n often turns out to be roughly 50% of n, and this over-sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over-sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc.