Prediction of structural classes for protein sequences and domains-Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy

Authors:
Lukasz A. Kurgan;Leila Homaeian
Affiliations:
Department of Electrical and Computer Engineering, University of Alberta, Canada;Department of Electrical and Computer Engineering, University of Alberta, Canada
Venue:
Pattern Recognition
Year:
2006

Citing 11
Cited 8

Instance-Based Learning Algorithms

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Random Forests

Machine Learning
Approximation, Dimension Reduction, and Nonconvex Optimization Using Linear Superpositions of Gaussians

IEEE Transactions on Computers
Protein Fold Class Prediction: New Methods of Statistical Classification

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Improvements to Platt's SMO Algorithm for SVM Classifier Design

Neural Computation
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Prediction of secondary protein structure content from primary sequence alone – a feature selection based approach

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Prediction of protein structural classes by a new measure of information discrepancy

Computational Biology and Chemistry

Algorithm Note: Variable predictive model based classification algorithm for effective separation of protein structural classes

Computational Biology and Chemistry
Binary particle swarm optimization based prediction of G-protein-coupled receptor families with feature selection

Proceedings of the first ACM/SIGEVO Summit on Genetic and Evolutionary Computation
Investigation into effectiveness of rough sets in prediction of enzyme and protein structure classes

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Using fuzzy support vector machine network to predict low homology protein structural classes

PRIB'07 Proceedings of the 2nd IAPR international conference on Pattern recognition in bioinformatics
Predict the tertiary structure of protein with flexible neural tree

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
Exploring potential discriminatory information embedded in PSSM to enhance protein structural class prediction accuracy

PRIB'13 Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics
Combining multiple views: Case studies on protein and arrhythmia features

Engineering Applications of Artificial Intelligence
Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition

International Journal of Data Mining and Bioinformatics

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper addresses computational prediction of protein structural classes. Although in recent years progress in this field was made, the main drawback of the published prediction methods is a limited scope of comparison procedures, which in same cases were also improperly performed. Two examples include using protein datasets of varying homology, which has significant impact on the prediction accuracy, and comparing methods in pairs using different datasets. Based on extensive experimental work, the main aim of this paper is to revisit and reevaluate state of the art in this field. To this end, this paper performs a first-of-its-kind comprehensive and multi-goal study, which includes investigation of eight prediction algorithms, three protein sequence representations, three datasets with different homologies and finally three test procedures. Quality of several previously unused prediction algorithms, newly proposed sequence representation, and a new-to-the-field testing procedure is evaluated. Several important conclusions and findings are made. First, the logistic regression classifier, which was not previously used, is shown to perform better than other prediction algorithms, and high quality of previously used support vector machines is confirmed. The results also show that the proposed new sequence representation improves accuracy of the high quality prediction algorithms, while it does not improve results of the lower quality classifiers. The study shows that commonly used jackknife test is computationally expensive, and therefore computationally less demanding 10-fold cross-validation procedure is proposed. The results show that there is no statistically significant difference between these two procedures. The experiments show that sequence homology has very significant impact on the prediction accuracy, i.e. using highly homologous datasets results in higher accuracies. Thus, results of several past studies that use homologous datasets should not be perceived as reliable. The best achieved prediction accuracy for low homology datasets is about 57% and confirms results reported by Wang and Yuan [How good is the prediction of protein structural class by the component-coupled method?. Proteins 2000;38:165-175]. For a highly homologous dataset instance based classification is shown to be better than the previously reported results. It achieved 97% prediction accuracy demonstrating that homology is a major factor that can result in the overestimated prediction accuracy.