DIRICHLET MIXTURES: A METHOD FOR IMPROVING DETECTION OF WEAK BUT SIGNIFICANT PROTEIN SEQUENCE HOMOLOGY

Authors:
Kimmen Sjolander;Kevin Karplus;Michael Brown;Richard Hughey;Anders Krogh;I. S Mian;David Haussler
Affiliations:
-;-;-;-;-;-;-
Venue:
DIRICHLET MIXTURES: A METHOD FOR IMPROVING DETECTION OF WEAK BUT SIGNIFICANT PROTEIN SEQUENCE HOMOLOGY
Year:
1996

Citing 0
Cited 4

Improving CBIR Systems by Integrating Semantic Features

CRV '04 Proceedings of the 1st Canadian Conference on Computer and Robot Vision
Bayesian Segmental Models with Multiple Sequence Alignment Profiles for Protein Secondary Structure and Contact Map Prediction

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Probabilistic distance measures of the Dirichlet and Beta distributions

Pattern Recognition
Context-Specific Independence Mixture Modelling for Protein Families

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.