Stability-based model selection for high throughput genomic data: an algorithmic paradigm

Authors:
Raffaele Giancarlo;Filippo Utro
Affiliations:
Dipartimento di Matematica ed Informatica, University of Palermo, Palermo, Italy;Computational Biology Center, IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
ICARIS'12 Proceedings of the 11th international conference on Artificial Immune Systems
Year:
2012

Citing 10
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Bagging predictors

Machine Learning
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
A Better Understanding and an Improved Solution to the Problems of Stereophonic Acoustic Echo Cancellation

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Computational cluster validation in post-genomic data analysis

Bioinformatics
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation
Mosclust: a software library for discovering significant structures in bio-molecular data

Bioinformatics
Biological Data Mining

Biological Data Mining
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory
Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. Unfortunately, this fascinating and classic area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we highlight: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (C) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the Literature for stability-based model selection methods.