Mapping and sequencing the human genome: a beginner's guide to the computational science perspective
Crossroads - Special issue on interdisciplinary computer science
On the arithmetic power of context-free languages
CCSC '01 Proceedings of the sixth annual CCSC northeastern conference on The journal of computing in small colleges
Computational challenges in structural and functional genomics
IBM Systems Journal - Deep computing for the life sciences
Fundamenta Informaticae
Hi-index | 0.00 |
Stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of homologous RNA sequences. These models capture the common primary and secondary structure of the sequences with a context-free grammar, much like those used to define the syntax of programming languages. SDFGs generalize the hidden Markov models used in related work on protein and DNA sequences. The novel aspect of this work is that the SCFGs developed here are learned automatically from initially unaligned and unfolded training sequences. To do this, a new generalization of the forward- backward algorithm, commonly used to train hidden Markov models, is introduced. This algorithm is based on tree grammars, and is more efficient than the inside-outside algorithm, which was previously proposed to train SCFGs. This method is tested on the family of transfer RNA (tRNA) sequences. The results show that the model is able to reliably discriminate tRNA sequences from other RNA sequences of similar length, that it can reliably determine the secondary structure of new tRNA sequences, and that it can produce accurate multiple alignments of large collections of tRNA sequences. The model is also extended to handle introns present in tRNA genes.