Bootstrap Techniques for Error Estimation
IEEE Transactions on Pattern Analysis and Machine Intelligence
C4.5: programs for machine learning
C4.5: programs for machine learning
Stochastic Complexity in Statistical Inquiry Theory
Stochastic Complexity in Statistical Inquiry Theory
Machine Learning
Machine Learning
SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM
SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
On biases in estimating multi-valued attributes
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
On the Importance of Comprehensible Classification Models for Protein Function Prediction
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Hi-index | 0.00 |
We study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction that can be used to complement and support other techniques based on sequence or structure information. In order to define this new measure of similarity between proteins we collected a set of 453 features and properties that characterize proteins and are believed to be correlated and related to structural and functional aspects of proteins. Among these properties are the composition and fraction of different groups of amino acids, predicted secondary structure content, molecular weight, average hydrophobicity, isoelectric point and others, as well as a set of properties that are extracted from database records of known protein sequences, such as subcellular location, tissue specificity, and others.We introduce the mixture model of probabilistic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam sequence-based classification of proteins and the EC classification of enzyme families. The model is very effective in learning highly diverged protein families or families that are not defined based on sequence. The resulting tree structure indicates the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.