Protein Classification into Domains of Life Using Markov Chain Models

Authors:
Francisca Zanoguera;Massimo de Francesco
Affiliations:
Serono Pharmaceutical Research Institute;Serono Pharmaceutical Research Institute
Venue:
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Year:
2004

Citing 0
Cited 3

Combined classifier for unknown genome classification using chaos game representation features

ISB '10 Proceedings of the International Symposium on Biocomputing
Species identification based on approximate matching

COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Clustering genome data based on approximate matching

International Journal of Data Analysis Techniques and Strategies

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has recently been shown that oligopeptide composition allows clustering proteomes of different organisms into the main domains of life. In this paper, we go a step further by showing that, given a single protein, it is possible to predict whether it has a bacterial or eukaryotic origin with 85% accuracy, and we obtain this result after ensuring that no important homologies exist between the sequences in the test set and the sequences in the training set. To do this, we model the sequence as a Markov chain. A bacterial and an eukaryote model are produced using the training sets. Each input sequence is then classified by calculating the log-odds ratio of the sequence probability for each model. By analyzing the models obtained we extract a set of most discriminant oligopeptides, many of which are part of known functional motifs.