Brief Communication: A feature vector integration approach for a generalized support vector machine pairwise homology algorithm

  • Authors:
  • Bobbie-Jo M. Webb-Robertson;Christopher S. Oehmen;Anuj R. Shah

  • Affiliations:
  • Computational Biology & Bioinformatics, Pacific Northwest National Laboratory, United States;Computational Biology & Bioinformatics, Pacific Northwest National Laboratory, United States;Scientific Data Management, Pacific Northwest National Laboratory, United States

  • Venue:
  • Computational Biology and Chemistry
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the exponential growth of sequenced genomes, the need to quickly provide accurate annotation for existing and new sequences is paramount to facilitate biological research. Current sequence comparison approaches fail to detect homologous relationships when sequence similarity is low. Support vector machine (SVM) algorithms approach this problem by transforming all proteins into a feature space of equal dimension based on protein properties, such as sequence similarity scores against a basis set of proteins or motifs. This multivariate representation of the protein space is then used to build a classifier specific to a pre-defined protein family. However, this approach is not well suited to large-scale annotation. We have developed a SVM approach that formulates remote homology as a single classifier that answers the pairwise comparison problem by integrating the two feature vectors for a pair of sequences into a single vector representation that can be used to build a classifier that separates sequence pairs into homologs and non-homologs. This pairwise SVM approach significantly improves the task of remote homology detection on the benchmark dataset, quantified as the area under the receiver operating characteristic curve; 0.97 versus 0.73 and 0.70 for PSI-BLAST and Basic Local Alignment Search Tool (BLAST), respectively.