Performance of data resampling methods for robust class discovery based on clustering

Authors:
Ulrich Möller;Dörte Radke
Affiliations:
Leibniz Institute for Natural Products Research and Infection Biology -- Hans Knöll Institute, D-07745 Jena, Germany;Leibniz Institute for Natural Products Research and Infection Biology -- Hans Knöll Institute, D-07745 Jena, Germany
Venue:
Intelligent Data Analysis
Year:
2006

Citing 11
Cited 3

A Greedy EM Algorithm for Gaussian Mixture Learning

Neural Processing Letters
Cluster validation techniques for genome expression data

Signal Processing - Special issue: Genomic signal processing
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Problems in gene clustering based on gene expression data

Journal of Multivariate Analysis
On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples

Journal of Multivariate Analysis
Evaluation and optimization of clustering in gene expression data analysis

Bioinformatics
Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection

Bioinformatics
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation
Pattern Recognition, Third Edition

Pattern Recognition, Third Edition
Some new indexes of cluster validity

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Quality indices for (practical) clustering evaluation

Intelligent Data Analysis
A metric to evaluate a cluster by eliminating effect of complement cluster

KI'11 Proceedings of the 34th Annual German conference on Advances in artificial intelligence
A new asymmetric criterion for cluster validation

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data resampling techniques are increasingly used for assigning confidence to clustering results, in particular for tumor class discovery based on genomic data. One factor that determines the success of this approach is the capability of a resampling scheme to simulate the sampling variability by using the information of sparse sample data. We present a method for evaluating resampling performance based on model simulations. This method was applied to results of 40 cluster validity indices and one partition stability index obtained from 12 clustering procedures including different distance measures. The results were generated for benchmark data of five statistical models, gene expression profiles of three multi-class tumor sample data sets, four data sets of the widely used UCI repository, and spatiotemporal neuroimaging data. The results suggest a ranking of the three resampling techniques analyzed: perturbation (adding noise to the data) was more effective than subsampling and both clearly outperformed the bootstrapping technique in the detection of correct clustering consensus results. Due to the consistency of the results this ranking may have impact on the selection of a resampling method for the cluster validation in future studies. Moreover, intelligent control of the resampling parameters can increase the achievable confidence in clustering results.