Classification of biological sequences with kernel methods
ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Hi-index | 3.85 |
Motivation: In the event of an outbreak of a disease caused by an initially unknown pathogen, the ability to characterize anonymous sequences prior to isolation and culturing of the pathogen will be helpful. We show that it is possible to classify viral sequences by genome type (dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand, retroid) using amino acid distribution. Results: In this paper we describe the results of analysis of amino acid preference in mammalian viruses. The study was carried out at the genome level as well as two shorter sequence levels: short (300 amino acids) and medium length (660 amino acids). The analysis indicates a correlation between the viral genome types dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand and retroid and amino acid preference. We investigated three different models of amino acid preference. The simplest amino acid preference model, 1-AAP, is a normalized description of the frequency of amino acids in genomes of a viral genome type. A slightly more complex model is the ordered pair amino acid preference model (2-AAP), which characterizes genomes of different viral genome types by the frequency of ordered pairs of amino acids. The most complex and accurate model is the ordered triple amino acid preference model (3-AAP), which is based on ordered triples of amino acids. The results demonstrate that mammalian viral genome types differ in their amino acid preference. Availability: The tools used to format and analyze data and supplementary material are available at http://www.cse.sc.edu/~rose/aminoPreference/index.html Contact: rose@cse.sc.edu