A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
A compression algorithm for DNA sequences and its applications in genome comparison
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
A Simple Statistical Algorithm for Biological Sequence Compression
DCC '07 Proceedings of the 2007 Data Compression Conference
Computing Substitution Matrices for Genomic Comparative Analysis
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Hi-index | 0.00 |
Phylogenetic analyses of species based on single genes or parts of the genomes are often inconsistent because of factors such as variable rates of evolution and horizontal gene transfer. The availability of more and more sequenced genomes allows phylogeny construction from complete genomes that is less sensitive to such inconsistency. For such long sequences, construction methods like maximum parsimony and maximum likelihood are often not possible due to their intensive computational requirement. Another class of tree construction methods, namely distance-based methods, require a measure of distances between any two genomes. Some measures such as evolutionary edit distance of gene order and gene content are computational expensive or do not perform well when the gene content of the organisms are similar. This study presents an information theoretic measure of genetic distances between genomes based on the biological compression algorithm expert model . We demonstrate that our distance measure can be applied to reconstruct the consensus phylogenetic tree of a number of Plasmodium parasites from their genomes, the statistical bias of which would mislead conventional analysis methods. Our approach is also used to successfully construct a plausible evolutionary tree for the *** -Proteobacteria group whose genomes are known to contain many horizontally transferred genes.