Semi-supervised clustering of large data sets with kernel methods

  • Authors:
  • Stefan Fauíer;Friedhelm Schwenker

  • Affiliations:
  • -;-

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2014

Quantified Score

Hi-index 0.10

Visualization

Abstract

Labelling real world data sets is a difficult problem. Often, the human expert is unsure about a class label of a specific sample point or, in case of very large data sets, it is impractical to label them manually. In semi-supervised clustering, the sample labels, which are external informations, are used to find better matching cluster partitions. Further, kernel-based clustering methods are able to partition the data with nonlinear boundaries in feature space. While these methods improve the clustering results, they have a quadratic computation time. In this paper, we propose a meta-algorithm that processes small-sized subsets of a large data set, clusters them with the sample labels and merges the points close to the resulting prototypes with the next points, until the whole data set has been processed. It has a linear computation time. The error function that this meta-algorithm minimizes is presented. Although we applied this meta-algorithm to Kernel Fuzzy C-Means, Relational Neural Gas and Kernel K-Means, it can be applied to a broad range of kernel-based clustering methods. The proposed method has been empirically evaluated on two real world benchmark data sets.