Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

Authors:
Alberto Bertoni;Giorgio Valentini
Affiliations:
DSI, Dipartimento di Scienze dell' Informazione, Universití degli Studi di Milano, Via Comelico 39, Milano, Italy;DSI, Dipartimento di Scienze dell' Informazione, Universití degli Studi di Milano, Via Comelico 39, Milano, Italy
Venue:
Artificial Intelligence in Medicine
Year:
2006

Citing 17
Cited 13

Efficient Approximations for the MarginalLikelihood of Bayesian Networks with Hidden Variables

Machine Learning - Special issue on learning with probabilistic representations
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data clustering: a review

ACM Computing Surveys (CSUR)
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Random projection in dimensionality reduction: applications to image and text data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Biclustering of Expression Data

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Algorithmic Applications of Low-Distortion Geometric Embeddings

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Analysis of variance components in gene expression data

Bioinformatics
Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health)

Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health)
GenSo-FDSS: a neural-fuzzy decision support system for pediatric ALL cancer subtype identification using gene expression data

Artificial Intelligence in Medicine
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation
Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data

Bioinformatics
Moderate diversity for better cluster ensembles

Information Fusion
Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles

Artificial Intelligence in Medicine
Fuzzy cluster analysis of high-field functional MRI data

Artificial Intelligence in Medicine

An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer

Artificial Intelligence in Medicine
Fuzzy Ensemble Clustering for DNA Microarray Data Analysis

WILF '07 Proceedings of the 7th international workshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory
An Algorithm to Assess the Reliability of Hierarchical Clusters in Gene Expression Data

KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III
Fuzzy ensemble clustering based on random projections for DNA microarray data analysis

Artificial Intelligence in Medicine
Classification of DNA microarray data with Random Projection Ensembles of Polynomial SVMs

Proceedings of the 2009 conference on New Directions in Neural Networks: 18th Italian Workshop on Neural Networks: WIRN 2008
Unsupervised Stability-Based Ensembles to Discover Reliable Structures in Complex Bio-molecular Data

Computational Intelligence Methods for Bioinformatics and Biostatistics
A stability-based algorithm to validate hierarchical clusters of genes

International Journal of Knowledge Engineering and Soft Data Paradigms
Discovering significant structures in clustered bio-molecular data through the bernstein inequality

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part III
Multi-step dimensionality reduction and semi-supervised graph-based tumor classification using gene expression data

Artificial Intelligence in Medicine
Hybrid cluster ensemble framework based on the random combination of data transformation operators

Pattern Recognition
Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

Theoretical Computer Science
From cluster ensemble to structure ensemble

Information Sciences: an International Journal
SC³: Triple Spectral Clustering-Based Consensus Clustering Framework for Class Discovery from Cancer Gene Expression Profiles

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.02

Visualization

Abstract

Objective:: Clustering algorithms may be applied to the analysis of DNA microarray data to identify novel subgroups that may lead to new taxonomies of diseases defined at bio-molecular level. A major problem related to the identification of biologically meaningful clusters is the assessment of their reliability, since clustering algorithms may find clusters even if no structure is present. Methodology:: Recently, methods based on random ''perturbations'' of the data, such as bootstrapping, noise injections techniques and random subspace methods have been applied to the problem of cluster validity estimation. In this framework, we propose stability measures that exploits the high dimensionality of DNA microarray data and the redundancy of information stored in microarray chips. To this end we randomly project the original gene expression data into lower dimensional subspaces, approximately preserving the distance between the examples according to the Johnson-Lindenstrauss (JL) theory. The stability of the clusters discovered in the original high dimensional space is estimated by comparing them with the clusters discovered in randomly projected lower dimensional subspaces. The proposed cluster-stability measures may be applied to validate and to quantitatively assess the reliability of the clusters obtained by a large class of clustering algorithms. Results and conclusion:: We tested the effectiveness of our approach with high dimensional synthetic data, whose distribution is a priori known, showing that the stability measures based on randomized maps correctly predict the number of clusters and the reliability of each individual cluster. Then we showed how to apply the proposed measures to the analysis of DNA microarray data, whose underlying distribution is unknown. We evaluated the validity of clusters discovered by hierarchical clustering algorithms in diffuse large B-cell lymphoma (DLBCL) and malignant melanoma patients, showing that the proposed reliability measures can support bio-medical researchers in the identification of stable clusters of patients and in the discovery of new subtypes of diseases characterized at bio-molecular level.