Using a mixture of probabilistic decision trees for direct prediction of protein function

  • Authors:
  • Umar Syed;Golan Yona

  • Affiliations:
  • Cornell University, Ithica, NY;Cornell University, Ithica, NY

  • Venue:
  • RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction that can be used to complement and support other techniques based on sequence or structure information. In order to define this new measure of similarity between proteins we collected a set of 453 features and properties that characterize proteins and are believed to be correlated and related to structural and functional aspects of proteins. Among these properties are the composition and fraction of different groups of amino acids, predicted secondary structure content, molecular weight, average hydrophobicity, isoelectric point and others, as well as a set of properties that are extracted from database records of known protein sequences, such as subcellular location, tissue specificity, and others.We introduce the mixture model of probabilistic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam sequence-based classification of proteins and the EC classification of enzyme families. The model is very effective in learning highly diverged protein families or families that are not defined based on sequence. The resulting tree structure indicates the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.