Using a mixture of probabilistic decision trees for direct prediction of protein function

Authors:
Umar Syed;Golan Yona
Affiliations:
Cornell University, Ithica, NY;Cornell University, Ithica, NY
Venue:
RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Year:
2003

Citing 9
Cited 4

Bootstrap Techniques for Error Estimation

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Distance-Based Attribute Selection Measure for Decision Tree Induction

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Machine Learning

Machine Learning
Induction of Decision Trees

Machine Learning
SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM

SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
On biases in estimating multi-valued attributes

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

On the Importance of Comprehensible Classification Models for Protein Function Prediction

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Prediction of secondary protein structure content from primary sequence alone – a feature selection based approach

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
When less is more: improving classification of protein families with a minimal set of global features

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Selecting different protein representations and classification algorithms in hierarchical protein function prediction

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction that can be used to complement and support other techniques based on sequence or structure information. In order to define this new measure of similarity between proteins we collected a set of 453 features and properties that characterize proteins and are believed to be correlated and related to structural and functional aspects of proteins. Among these properties are the composition and fraction of different groups of amino acids, predicted secondary structure content, molecular weight, average hydrophobicity, isoelectric point and others, as well as a set of properties that are extracted from database records of known protein sequences, such as subcellular location, tissue specificity, and others.We introduce the mixture model of probabilistic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam sequence-based classification of proteins and the EC classification of enzyme families. The model is very effective in learning highly diverged protein families or families that are not defined based on sequence. The resulting tree structure indicates the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.