The nature of statistical learning theory
The nature of statistical learning theory
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Using the Fisher Kernel Method to Detect Remote Protein Homologies
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Frequent-subsequence-based prediction of outer membrane proteins
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Profile-Based String Kernels for Remote Homology Detection and Motif Extraction
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Semi-supervised protein classification using cluster kernels
Bioinformatics
Localization Site Prediction for Membrane Proteins by Integrating Rule and SVM Classification
IEEE Transactions on Knowledge and Data Engineering
BIBE '05 Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering
Nonlinear Time Series Analysis
Nonlinear Time Series Analysis
Remote homology detection based on oligomer distances
Bioinformatics
g-MARS: Protein Classification Using Gapped Markov Chains and Support Vector Machines
PRIB '08 Proceedings of the Third IAPR International Conference on Pattern Recognition in Bioinformatics
Hi-index | 0.01 |
Classification of protein sequences has important applications in areas such as disease diagnosis, treatment development and drug design. In this paper we present a highly accurate classifier called the g-MARS (gapped Markov Chain with support vector machine) protein classifier. It models the structure of a protein sequence by measuring the transition probabilities between pairs of amino acids. This results in a Markov chain style model for each protein sequence. Then, to capture the similarity among non-exactly matching protein sequences, we show that this model can be generalized to incorporate gaps in the Markov chain. Theoretical justification for the power of our gapped feature space model is provided through its connections to analysis methods for nonlinear dynamical systems. We perform an experimental study and compare g-MARS to several other state-of-the-art protein classifiers. Overall, we demonstrate that g-MARS has high accuracy and operates efficiently on a diverse range of protein families.