Probabilistic hierarchical clustering for biological data

Authors:
Eran Segal;Daphne Koller
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
Proceedings of the sixth annual international conference on Computational biology
Year:
2002

Citing 6
Cited 6

Elements of information theory

Elements of information theory
A structural EM algorithm for phylogenetic inference

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains

ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology
The Bayesian structural EM algorithm

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data

WABI '02 Proceedings of the Second International Workshop on Algorithms in Bioinformatics
Clustering of diverse genomic data using information fusion

Proceedings of the 2004 ACM symposium on Applied computing
Utilizing hierarchical feature domain values for prediction

Data & Knowledge Engineering
Clustering gene expression data via mining ensembles of classification rules evolved using moses

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Exploiting hierarchical domain values for Bayesian learning

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Reporting and analyzing alternative clustering solutions by employing multi-objective genetic algorithm and conducting experiments on cancer data

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Biological data, such as gene expression profiles or protein sequences, is often organized in a hierarchy of classes, where the instances assigned to "nearby" classes in the tree are similar. Most approaches for constructing a hierarchy use simple local operations, that are very sensitive to noise or variation in the data. In this paper, we describe probabilistic abstraction hierarchies (PAH) [11], a general probabilistic framework for clustering data into a hierarchy, and show how it can be applied to a wide variety of biological data sets. In a PAH, each class is associated with a probabilistic generative model for the data in the class. The PAH clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the PAH approach is that it utilizes global optimization algorithms for the last two steps, substantially reducing the sensitivity to noise and the propensity to local maxima. We show how to apply this framework to gene expression data, protein sequence data, and HIV protease sequence data. We also show how our framework supports hierarchies involving more than one type of data. We demonstrate that our method extracts useful biological knowledge and is substantially more robust than hierarchical agglomerative clustering.