A permutation test for determining significance of clusters with applications to spatial and gene expression data

Authors:
P. J. Park;J. Manjourides;M. Bonetti;M. Pagano
Affiliations:
Harvard Medical School, Boston, MA, USA;Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA;Department of Decision Sciences, Bocconi University, Milan, Italy;Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
Venue:
Computational Statistics & Data Analysis
Year:
2009

Citing 7
Cited 0

Identifying genuine clusters in a classification

Computational Statistics & Data Analysis
Genetic Algorithms and Grouping Problems

Genetic Algorithms and Grouping Problems
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Information Theory, Inference & Learning Algorithms

Information Theory, Inference & Learning Algorithms
Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm

Bioinformatics
Permutation, Parametric, and Bootstrap Tests of Hypotheses (Springer Series in Statistics)

Permutation, Parametric, and Bootstrap Tests of Hypotheses (Springer Series in Statistics)
Defining clusters from a hierarchical cluster tree

Bioinformatics

Quantified Score

Hi-index	0.03

Visualization

Abstract

Hierarchical clustering is a common procedure for identifying structure in a dataset, and this is frequently used for organizing genomic data. Although more advanced clustering algorithms are available, the simplicity and visual appeal of hierarchical clustering have made it ubiquitous in gene expression data analysis. Hence, even minor improvements in this framework would have significant impact. There is currently no simple and systematic way of assessing and displaying the significance of various clusters in a resulting dendrogram without making certain distributional assumptions or ignoring gene-specific variances. In this work, we introduce a permutation test based on comparing the within-cluster structure of the observed data with those of sample datasets obtained by permuting the cluster membership. We carry out this test at each node of the dendrogram using a statistic derived from the singular value decomposition of variance matrices. The p-values thus obtained provide insight into the significance of each cluster division. Given these values, one can also modify the dendrogram by combining non-significant branches. By adjusting the cut-off level of significance for branches, one can produce dendrograms with a desired level of detail for ease of interpretation. We demonstrate the usefulness of this approach by applying it to illustrative datasets.