Unsupervised Stability-Based Ensembles to Discover Reliable Structures in Complex Bio-molecular Data

Authors:
Alberto Bertoni;Giorgio Valentini
Affiliations:
DSI, Dipartimento di Scienze dell' Informazione, Università degli Studi di Milano, Milano, Italia 20135;DSI, Dipartimento di Scienze dell' Informazione, Università degli Studi di Milano, Milano, Italia 20135
Venue:
Computational Intelligence Methods for Bioinformatics and Biostatistics
Year:
2009

Citing 20
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Data clustering: a review

ACM Computing Surveys (CSUR)
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Algorithmic Applications of Low-Distortion Geometric Embeddings

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Stability-based validation of clustering solutions

Neural Computation
Clustering of diverse genomic data using information fusion

Bioinformatics
Computational cluster validation in post-genomic data analysis

Bioinformatics
Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data

Bioinformatics
Data Analysis and Visualization in Genomics and Proteomics

Data Analysis and Visualization in Genomics and Proteomics
Mosclust: a software library for discovering significant structures in bio-molecular data

Bioinformatics
Clustering and visualization approaches for human cell cycle gene expression data analysis

International Journal of Approximate Reasoning
Using repeated measurements to validate hierarchical gene clusters

Bioinformatics
2008 Special Issue: Interactive data analysis and clustering of genomic data

Neural Networks
An Algorithm to Assess the Reliability of Hierarchical Clusters in Gene Expression Data

KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III
Fuzzy ensemble clustering based on random projections for DNA microarray data analysis

Artificial Intelligence in Medicine
Stability and Performances in Biclustering Algorithms

Computational Intelligence Methods for Bioinformatics and Biostatistics
Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

Artificial Intelligence in Medicine
Discovering significant structures in clustered bio-molecular data through the bernstein inequality

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part III
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The assessment of the reliability of clusters discovered in bio-molecular data is a central issue in several bioinformatics problems. Several methods based on the concept of stability have been proposed to estimate the reliability of each individual cluster as well as the "optimal" number of clusters. In this conceptual framework a clustering ensemble is obtained through bootstrapping techniques, noise injection into the data or random projections into lower dimensional subspaces. A measure of the reliability of a given clustering is obtained through specific stability/reliability scores based on the similarity of the clusterings composing the ensemble. Classical stability-based methods do not provide an assessment of the statistical significance of the clustering solutions and are not able to directly detect multiple structures (e.g. hierarchical structures) simultaneously present in the data. Statistical approaches based on the chi-square distribution and on the Bernstein inequality, show that stability-based methods can be successfully applied to the statistical assessment of the reliability of clusters, and to discover multiple structures underlying complex bio-molecular data. In this paper we provide an overview of stability based methods, focusing on stability indices and statistical tests that we recently proposed in the context of the analysis of gene expression data.