Automatic annotation of protein functional class from sparse and imbalanced data sets

Authors:
Jaehee Jung;Michael R. Thon
Affiliations:
Department of Computer Science;Department of Computer Science
Venue:
VDMB'06 Proceedings of the First international conference on Data Mining and Bioinformatics
Year:
2006

Citing 9
Cited 1

Genome scale prediction of protein functional class from sequence using data mining

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Gene functional classification from heterogeneous data

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The class imbalance problem: A systematic study

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Dynamic bayesian network modeling of cyanobacterial biological processes via gene clustering

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, high-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function however the process for annotating proteins with GO terms is usually through a tedious manual curation process by trained professional annotators. With the wealth of genomic data that are now available, there is a need for accurate automated annotation methods. In this paper, we propose a method for automatically predicting GO terms for proteins by applying statistical pattern recognition techniques. We employ protein functional domains as features and learn independent Support Vector Machine classifiers for each GO term. This approach creates sparse data sets with highly imbalanced class distribution. We show that these problems can be overcome with standard feature and instance selection methods. We also present a meta-learning scheme that utilizes multiple SVMs trained for each GO term, resulting in improved overall performance than either SVM can achieve alone. The implementation of the tool is available at http://fcg.tamu.edu/AAPFC.