A feature selection algorithm based on graph theory and random forests for protein secondary structure prediction

Authors:
Gulsah Altun;Hae-Jin Hu;Stefan Gremalschi;Robert W. Harrison;Yi Pan
Affiliations:
Department of Computer Science, Georgia State University;Department of Computer Science, Georgia State University;Department of Computer Science, Georgia State University;Department of Computer Science, Georgia State University and Department of Biology, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University
Venue:
ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
Year:
2007

Citing 5
Cited 0

Random Forests

Machine Learning
A fast algorithm for the maximum clique problem

Discrete Applied Mathematics - Sixth Twente Workshop on Graphs and Combinatorial Optimization
A new representation for protein secondary structure prediction based on frequent patterns

Bioinformatics
Feature analysis and classification of protein secondary structure data

ICANN/ICONIP'03 Proceedings of the 2003 joint international conference on Artificial neural networks and neural information processing
Prediction of secondary protein structure content from primary sequence alone – a feature selection based approach

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Protein secondary structure prediction problem is one of the widely studied problems in bioinformatics. Predicting the secondary structure of a protein is an important step for determining its tertiary structure and thus its function. This paper explores the protein secondary structure problem using a novel feature selection algorithm combined with a machine learning approach based on random forests. For feature reduction, we propose an algorithm that uses a graph theoretical approach which finds cliques in the nonposition specific evolutionary profiles of proteins obtained from BLOSUM62. Then, the features selected by this algorithm are used for condensing the position specific evolutionary information obtained from PSI-BLAST. Our results show that we are able to save significant amount of space and time and still achieve high accuracy results even when the features of the data are 25% reduced.