Bayesian unsupervised learning of DNA regulatory binding regions

Authors:
Jukka Corander;Magnus Ekdahl;Timo Koski
Affiliations:
Department of Mathematics, Åbo Akademi University, Turku, Finland;Department of Mathematics, University of Linköping, Linköping, Sweden;Department of Mathematics, The Royal Institute of Technology, Stockholm, Sweden
Venue:
Advances in Artificial Intelligence
Year:
2009

Citing 11
Cited 0

The power of amnesia: learning probabilistic automata with variable memory length

Machine Learning - Special issue on COLT '94
A unified approach to word occurrence probabilities

Discrete Applied Mathematics - Special volume on combinatorial molecular biology
Modeling dependencies in protein-DNA binding sites

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Monte Carlo Statistical Methods (Springer Texts in Statistics)

Monte Carlo Statistical Methods (Springer Texts in Statistics)
BioOptimizer: a Bayesian scoring function approach to motif discovery

Bioinformatics
Identification of transcription factor binding sites with variable-order Bayesian networks

Bioinformatics
Bayesian model learning based on a parallel MCMC strategy

Statistics and Computing
Bayesian search of functionally divergent protein subgroups and their function specific residues

Bioinformatics
Computing exact P-values for DNA motifs

Bioinformatics
Efficient exact motif discovery

Bioinformatics
Assessing phylogenetic motif models for predicting transcription factor binding sites

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identification of regulatory binding motifs, that is, short specific words, within DNA sequences is a commonly occurring problem in computational bioinformatics. A wide variety of probabilistic approaches have been proposed in the literature to either scan for previously known motif types or to attempt de novo identification of a fixed number (typically one) of putative motifs. Most approaches assume the existence of reliable biodatabase information to build probabilistic a priori description of the motif classes. Examples of attempts to do probabilistic unsupervised learning about the number of putative de novo motif types and their positions within a set of DNA sequences are very rare in the literature. Here we show how such a learning problem can be formulated using a Bayesian model that targets to simultaneously maximize the marginal likelihood of sequence data arising under multiple motif types as well as under the background DNA model, which equals a variable length Markov chain. It is demonstrated how the adopted Bayesian modelling strategy combined with recently introduced nonstandard stochastic computation tools yields a more tractable learning procedure than is possible with the standard Monte Carlo approaches. Improvements and extensions of the proposed approach are also discussed.