Approximate pairwise clustering for large data sets via sampling plus extension

  • Authors:
  • Liang Wang;Christopher Leckie;Ramamohanarao Kotagiri;James Bezdek

  • Affiliations:
  • National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia;Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia;Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia

  • Venue:
  • Pattern Recognition
  • Year:
  • 2011

Quantified Score

Hi-index 0.05

Visualization

Abstract

Pairwise clustering methods have shown great promise for many real-world applications. However, the computational demands of these methods make them impractical for use with large data sets. The contribution of this paper is a simple but efficient method, called eSPEC, that makes clustering feasible for problems involving large data sets. Our solution adopts a ''sampling, clustering plus extension'' strategy. The methodology starts by selecting a small number of representative samples from the relational pairwise data using a selective sampling scheme; then the chosen samples are grouped using a pairwise clustering algorithm combined with local scaling; and finally, the label assignments of the remaining instances in the data are extended as a classification problem in a low-dimensional space, which is explicitly learned from the labeled samples using a cluster-preserving graph embedding technique. Extensive experimental results on several synthetic and real-world data sets demonstrate both the feasibility of approximately clustering large data sets and acceleration of clustering in loadable data sets of our method.