Effects of resampling method and adaptation on clustering ensemble efficacy

  • Authors:
  • Behrouz Minaei-Bidgoli;Hamid Parvin;Hamid Alinejad-Rokny;Hosein Alizadeh;William F. Punch

  • Affiliations:
  • Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran;Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran;Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran and , Ghaemshahr, Iran 4761764467;Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran;Department of Computer Science and Engineering, Michigan State University, East Lansing, USA 48824

  • Venue:
  • Artificial Intelligence Review
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.