Prediction of secondary protein structure content from primary sequence alone – a feature selection based approach

Authors:
Lukasz Kurgan;Leila Homaeian
Affiliations:
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada;Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada
Venue:
MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2005

Citing 4
Cited 2

Application of neural networks to biological data mining: a case study in protein sequence classification

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Using a mixture of probabilistic decision trees for direct prediction of protein function

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Weave amino acid sequences for protein secondary structure prediction

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences

Artificial Intelligence in Medicine

Prediction of structural classes for protein sequences and domains-Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy

Pattern Recognition
A feature selection algorithm based on graph theory and random forests for protein secondary structure prediction

ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research in protein structure and function is one of the most important subjects in modern bioinformatics and computational biology. It often uses advanced data mining and machine learning methodologies to perform prediction or pattern recognition tasks. This paper describes a new method for prediction of protein secondary structure content based on feature selection and multiple linear regression. The method develops a novel representation of primary protein sequences based on a large set of 495 features. The feature selection task performed using very large set of nearly 6,000 proteins, and tests performed on standard non-homologues protein sets confirm high quality of the developed solution. The application of feature selection and the novel representation resulted in 14-15% error rate reduction when compared to results achieved when standard representation is used. The prediction tests also show that a small set of 5-25 features is sufficient to achieve accurate prediction for both helix and strand content for non-homologous proteins.