Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

Authors:
R. Giancarlo;F. Utro
Affiliations:
University of Palermo, Dipartimento di Matematica ed Informatica, Via Archirafi 34, 90123 Palermo, Italy;IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA
Venue:
Theoretical Computer Science
Year:
2012

Citing 21
Cited 1

Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Algorithms for clustering data

Algorithms for clustering data
Bagging predictors

Machine Learning
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Random projection in dimensionality reduction: applications to image and text data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Information Retrieval

Information Retrieval
An elementary proof of a theorem of Johnson and Lindenstrauss

Random Structures & Algorithms
Comparing Data Streams Using Hamming Norms (How to Zero In)

IEEE Transactions on Knowledge and Data Engineering
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
A Better Understanding and an Improved Solution to the Problems of Stereophonic Acoustic Echo Cancellation

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Database-friendly random projections: Johnson-Lindenstrauss with binary coins

Journal of Computer and System Sciences - Special issu on PODS 2001
Cluster analysis of gene expression data

Cluster analysis of gene expression data
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
Computational cluster validation in post-genomic data analysis

Bioinformatics
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation
Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Mosclust: a software library for discovering significant structures in bio-molecular data

Bioinformatics
The Johnson-Lindenstrauss lemma almost characterizes Hilbert space, but not quite

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
On Low Distortion Embeddings of Statistical Distance Measures into Low Dimensional Spaces

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

Artificial Intelligence in Medicine
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory

Stability-based model selection for high throughput genomic data: an algorithmic paradigm

ICARIS'12 Proceedings of the 11th international conference on Artificial Immune Systems

Quantified Score

Hi-index	5.23

Visualization

Abstract

The advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. It is very unfortunate that as fascinating and classic an area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we make the following contributions: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) reductions showing that all of the known methods in this class are an instance of the proposed paradigm; (C) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (D) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the literature for stability-based model selection methods. Since the performance evaluation of model selection algorithms is mainly experimental, we offer, for completeness and without even attempting to be exhaustive, a representative synopsis of known experimental benchmarking results that highlight the ability of stability-based methods for model selection and the computational resources that they require for the task. As a whole, the contributions of this paper generalize in several respects reference methodologies in statistics and show that algorithmic approaches can yield deep methodological insights into this area, in addition to practical computational procedures.