Effects of resampling method and adaptation on clustering ensemble efficacy

Authors:
Behrouz Minaei-Bidgoli;Hamid Parvin;Hamid Alinejad-Rokny;Hosein Alizadeh;William F. Punch
Affiliations:
Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran;Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran;Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran and , Ghaemshahr, Iran 4761764467;Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran;Department of Computer Science and Engineering, Michigan State University, East Lansing, USA 48824
Venue:
Artificial Intelligence Review
Year:
2014

Citing 29
Cited 0

The bootstrap approach to clustering

Proc. of the NATO Advanced Study Institute on Pattern recognition theory and applications
Algorithms for clustering data

Algorithms for clustering data
Bagging predictors

Machine Learning
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speed-Up: Demonstrated for Center-Based Data Clustering Algorithms

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Data Resampling for Path Based Clustering

Proceedings of the 24th DAGM Symposium on Pattern Recognition
Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Data Clustering Using Evidence Accumulation

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 4 - Volume 4
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Combining Multiple Weak Clusterings

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Ensembles of Partitions via Data Resampling

ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
A clustering method based on boosting

Pattern Recognition Letters
Adaptive Clustering Ensembles

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01
Combining Multiple Clusterings Using Evidence Accumulation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation
Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters

IEEE Transactions on Pattern Analysis and Machine Intelligence
A New Approach to Improve the Vote-Based Classifier Selection

NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 02
A Scalable Method for Improving the Performance of Classifiers in Multiclass Applications by Pairwise Classifiers and GA

NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 02
CCHR: Combination of Classifiers Using Heuristic Retraining

NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 02
Divide & Conquer Classification and Optimization by Genetic Algorithm

ICCIT '08 Proceedings of the 2008 Third International Conference on Convergence and Hybrid Information Technology - Volume 02
Neural Network Ensembles Using Clustering Ensemble and Genetic Algorithm

ICCIT '08 Proceedings of the 2008 Third International Conference on Convergence and Hybrid Information Technology - Volume 02
Characterization and evaluation of similarity measures for pairs of clusterings

Knowledge and Information Systems
Using genetic algorithms for data mining optimization in an educational web-based system

GECCO'03 Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartII
A new multiobjective clustering technique based on the concepts of stability and symmetry

Knowledge and Information Systems
GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.