Aligning sequences by minimum description length

Authors:
John S. Conery
Affiliations:
Department of Computer and Information Science, University of Oregon, Eugene, OR
Venue:
EURASIP Journal on Bioinformatics and Systems Biology
Year:
2007

Citing 8
Cited 0

The computational linguistics of biological sequences

Artificial intelligence and molecular biology
Pattern Discovery in Biosequences

ICGI '98 Proceedings of the 4th International Colloquium on Grammatical Inference
A minimum description length approach to grammar inference

Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing
Phylogenomic inference of protein molecular function: advances and challenges

Bioinformatics
Measuring the similarity of protein structures by means of the universal similarity metric

Bioinformatics
Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties

Bioinformatics
Homology assessment and molecular sequence alignment

Journal of Biomedical Informatics - Special issue: Phylogenetic inferencing: Beyond biology
The fragment assembly string graph

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.