Automatic annotation of protein functional class from sparse and imbalanced data sets

  • Authors:
  • Jaehee Jung;Michael R. Thon

  • Affiliations:
  • Department of Computer Science;Department of Computer Science

  • Venue:
  • VDMB'06 Proceedings of the First international conference on Data Mining and Bioinformatics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, high-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function however the process for annotating proteins with GO terms is usually through a tedious manual curation process by trained professional annotators. With the wealth of genomic data that are now available, there is a need for accurate automated annotation methods. In this paper, we propose a method for automatically predicting GO terms for proteins by applying statistical pattern recognition techniques. We employ protein functional domains as features and learn independent Support Vector Machine classifiers for each GO term. This approach creates sparse data sets with highly imbalanced class distribution. We show that these problems can be overcome with standard feature and instance selection methods. We also present a meta-learning scheme that utilizes multiple SVMs trained for each GO term, resulting in improved overall performance than either SVM can achieve alone. The implementation of the tool is available at http://fcg.tamu.edu/AAPFC.