Computing Substitution Matrices for Genomic Comparative Analysis

Authors:
Minh Duc Cao;Trevor I. Dix;Lloyd Allison
Affiliations:
Clayton School of Information Technology, Monash University, Clayton, Australia 3800;Clayton School of Information Technology, Monash University, Clayton, Australia 3800 and Faculty of Information & Communication Technologies, Swinburne University of Technology, Hawthorn, Australi ...;Clayton School of Information Technology, Monash University, Clayton, Australia 3800
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 1
Cited 1

A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference

A Distance Measure for Genome Phylogenetic Analysis

AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Substitution matrices describe the rates of mutating one character in a biological sequence to another character, and are important for many knowledge discovery tasks such as phylogenetic analysis and sequence alignment. Computing substitution matrices for very long genomic sequences of divergent or even unrelated species requires sensitive algorithms that can take into account differences in composition of the sequences. We present a novel algorithm that addresses this by computing a nucleotide substitution matrix specifically for the two genomes being aligned. The method is founded on information theory and in the expectation maximisation framework. The algorithm iteratively uses compression to align the sequences and estimates the matrix from the alignment, and then applies the matrix to find a better alignment until convergence. Our method reconstructs, with high accuracy, the substitution matrix for synthesised data generated from a known matrix with introduced noise. The model is then successfully applied to real data for various malaria parasite genomes, which have differing phylogenetic distances and composition that lessens the effectiveness of standard statistical analysis techniques.