On two-way Bayesian agglomerative clustering of gene expression data

Authors:
Anna Fowler;Nicholas A. Heard
Affiliations:
Department of Mathematics, Imperial College London, London, UK;Department of Mathematics, Imperial College London, London, UK
Venue:
Statistical Analysis and Data Mining
Year:
2012

Citing 7
Cited 0

Bayesian interpolation

Neural Computation
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Bayesian mixture model based clustering of replicated microarray data

Bioinformatics
Bayesian hierarchical clustering

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering microarray gene expression data using weighted Chinese restaurant process

Bioinformatics
Sparse Bayesian hierarchical modeling of high-dimensional clustering problems

Journal of Multivariate Analysis
Local spatial biclustering and prediction of urban juvenile delinquency and recidivism

Statistical Analysis and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article introduces an agglomerative Bayesian model-based clustering algorithm which outputs a nested sequence of two-way cluster configurations for an input matrix of data. Each two-way cluster configuration in the output hierarchy is specified by a row configuration and a column configuration whose Cartesian product partitions the data matrix. Variable selection is incorporated into the algorithm by identifying row clusters which form distinct groups defined by the column clusters, through the use of a mixture model. A primitive similarity measure between the two clusters is the multiplicative change in model posterior probability implied by their merger, and the hierarchy is formed by iteratively merging the cluster pair which maximize some fixed monotonic function of this quantity. A naive implementation of the algorithm would be to choose this function to be the identity function. However, when applying this naive algorithm to gene expression data where the number of genes being studied typically far exceeds the number of experimental samples available, this imbalanced dimensionality of the data results in an algorithmic bias toward merging samples. To counteract this bias, alternative functions of the similarity measure are considered which prevent degenerative behavior of the algorithm. The resulting improvements in the output cluster configurations are demonstrated on simulated data and the method is then applied to real gene expression data. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.