Genome scale prediction of protein functional class from sequence using data mining
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Gene functional classification from heterogeneous data
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An introduction to variable and feature selection
The Journal of Machine Learning Research
Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The class imbalance problem: A systematic study
Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Dynamic bayesian network modeling of cyanobacterial biological processes via gene clustering
ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part I
Hi-index | 0.00 |
In recent years, high-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function however the process for annotating proteins with GO terms is usually through a tedious manual curation process by trained professional annotators. With the wealth of genomic data that are now available, there is a need for accurate automated annotation methods. In this paper, we propose a method for automatically predicting GO terms for proteins by applying statistical pattern recognition techniques. We employ protein functional domains as features and learn independent Support Vector Machine classifiers for each GO term. This approach creates sparse data sets with highly imbalanced class distribution. We show that these problems can be overcome with standard feature and instance selection methods. We also present a meta-learning scheme that utilizes multiple SVMs trained for each GO term, resulting in improved overall performance than either SVM can achieve alone. The implementation of the tool is available at http://fcg.tamu.edu/AAPFC.