C4.5: programs for machine learning
C4.5: programs for machine learning
Bioinformatics: the machine learning approach
Bioinformatics: the machine learning approach
Machine Learning
Neural Networks - Special issue on neural networks and kernel methods for structured domains
Subsequence-based feature map for protein function classification
Computational Biology and Chemistry
Classification of Ligase Function Based on Multi-parametric Feature Extracted from Protein Sequence
ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Predictive model for protein function using modular neural approach
ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
HiSP: a probabilistic data mining technique for protein classification
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Hi-index | 0.07 |
This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs, A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in relation to known three-dimensional structures of members in the case of two families (MEROPS families C14 and M12) suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant three-dimensional structural features of proteins.