Combinatorial optimization: algorithms and complexity
Combinatorial optimization: algorithms and complexity
Algorithms for clustering data
Algorithms for clustering data
Machine Learning
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Random projection in dimensionality reduction: applications to image and text data
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Information Retrieval
An elementary proof of a theorem of Johnson and Lindenstrauss
Random Structures & Algorithms
Comparing Data Streams Using Hamming Norms (How to Zero In)
IEEE Transactions on Knowledge and Data Engineering
ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Database-friendly random projections: Johnson-Lindenstrauss with binary coins
Journal of Computer and System Sciences - Special issu on PODS 2001
Cluster analysis of gene expression data
Cluster analysis of gene expression data
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms
ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
Resampling Method for Unsupervised Estimation of Cluster Validity
Neural Computation
Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
The Johnson-Lindenstrauss lemma almost characterizes Hilbert space, but not quite
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
On Low Distortion Embeddings of Statistical Distance Measures into Low Dimensional Spaces
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses
Artificial Intelligence in Medicine
A sober look at clustering stability
COLT'06 Proceedings of the 19th annual conference on Learning Theory
Stability-based model selection for high throughput genomic data: an algorithmic paradigm
ICARIS'12 Proceedings of the 11th international conference on Artificial Immune Systems
Hi-index | 5.23 |
The advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. It is very unfortunate that as fascinating and classic an area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we make the following contributions: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) reductions showing that all of the known methods in this class are an instance of the proposed paradigm; (C) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (D) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the literature for stability-based model selection methods. Since the performance evaluation of model selection algorithms is mainly experimental, we offer, for completeness and without even attempting to be exhaustive, a representative synopsis of known experimental benchmarking results that highlight the ability of stability-based methods for model selection and the computational resources that they require for the task. As a whole, the contributions of this paper generalize in several respects reference methodologies in statistics and show that algorithmic approaches can yield deep methodological insights into this area, in addition to practical computational procedures.