Classification comparison of prediction of solvent accessibility from protein sequences

Authors:
Huiling Chen;Huan-Xiang Zhou;Xiaohua Hu;Illhoi Yoo
Affiliations:
Drexel University, Philadelphia, Pennsylvania;Florida State University, Tallahassee, Florida;Drexel University, Philadelphia, Pennsylvania;Drexel University, Philadelphia, Pennsylvania
Venue:
APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
Year:
2004

Citing 1
Cited 3

Making large-scale support vector machine learning practical

Advances in kernel methods

Brief Communication: A method for protein accessibility prediction based on residue types and conformational states

Computational Biology and Chemistry
Feature selection for genomic data sets through feature clustering

International Journal of Data Mining and Bioinformatics
Protein solvent accessibility prediction using support vector machines and sequence conservations

TAINN'05 Proceedings of the 14th Turkish conference on Artificial Intelligence and Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

The prediction of residue solvent accessibility from protein sequences has been studied by various methods. The direct comparison of these methods is impossible due to the variety of datasets used and the difference in structure definition. In this paper we choose 5 classification approaches (decision tree (DT), Support Vector Machine (SVM), Bayesian Statistics (BS), Neural Network (NN) and Multiple Linear Regression (MLR)) for predicting solvent accessibility based on the same dataset and using the same structure definition so that we can directly compare different methods. We evaluate these methods in a cross-validation test on 2148 unique proteins using single sequences and multiple sequences approaches with a cutoff of 20% for two-state definition of solvent accessibility. According to the experiment results, SVM and NN are both the best predictors with accuracy 79%, correlation coefficient 0.59, 2~4% superior to other three methods on multiple sequences prediction. A further test result on a blind test set from Critical Assessment of Techniques for Protein Structure Prediction experiment (CASP5) is consistent with this result. On single sequence prediction, DT, BS and MLR perform about the same at 71~72% with correlation coefficient 0.43. The improvement over the baseline model that use only the identity of target residue is small. Local sequence seems embed very little information on accessibility. Separate training according to protein size improves the prediction when there are sufficiently large dataset available. The consensus prediction combining the 5 approaches is not significantly better than the best single method.