Combining gene sequence similarity and textual information for gene function annotation in the literature

  • Authors:
  • Luo Si;Danni Yu;Daisuke Kihara;Yi Fang

  • Affiliations:
  • Department of Computer Science and Statistics, Purdue University, West Lafayette, USA 47906;Department of Statistics, Purdue University, West Lafayette, USA 47906;Department of Biology and Computer Science, Purdue University, West Lafayette, USA 47906;Department of Computer Science, Purdue University, West Lafayette, USA 47906

  • Venue:
  • Information Retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Annotation of the functions of genes and proteins is an essential step in genome analysis. Information extraction techniques have been proposed to obtain the function information of genes and proteins in the biomedical literature. However, the performance of most information extraction techniques of function annotation in the biomedical literature is not satisfactory due to the large variability in the expression of concepts in the biomedical literature. This paper proposes a framework to improve the gene function annotation in the literature by considering both the textual information in the literature and the functions of genes with sequences similar to a target gene. The new framework collects multiple types of evidence as: (i) textual information about gene functions by matching keywords of the gene functions; (ii) gene function information from the known functions of genes with sequences similar to a target gene; and (iii) the prior probabilities of gene functions to be associated with an arbitrary gene by mining the known gene functions from curated databases. A supervised learning method is utilized to obtain the weights for combining the three types of evidence to assign appropriate Gene Ontology terms for target genes. Empirical studies on two testbeds demonstrate that the combination of sequence similarity scores, function prior probabilities and textual information improves the accuracy of gene function annotation in the literature. The F-measure scores obtained with the proposed framework are substantially higher than the scores of the solutions in prior research.