A feature selection algorithm based on graph theory and random forests for protein secondary structure prediction

  • Authors:
  • Gulsah Altun;Hae-Jin Hu;Stefan Gremalschi;Robert W. Harrison;Yi Pan

  • Affiliations:
  • Department of Computer Science, Georgia State University;Department of Computer Science, Georgia State University;Department of Computer Science, Georgia State University;Department of Computer Science, Georgia State University and Department of Biology, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University

  • Venue:
  • ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Protein secondary structure prediction problem is one of the widely studied problems in bioinformatics. Predicting the secondary structure of a protein is an important step for determining its tertiary structure and thus its function. This paper explores the protein secondary structure problem using a novel feature selection algorithm combined with a machine learning approach based on random forests. For feature reduction, we propose an algorithm that uses a graph theoretical approach which finds cliques in the nonposition specific evolutionary profiles of proteins obtained from BLOSUM62. Then, the features selected by this algorithm are used for condensing the position specific evolutionary information obtained from PSI-BLAST. Our results show that we are able to save significant amount of space and time and still achieve high accuracy results even when the features of the data are 25% reduced.