Combining gene sequence similarity and textual information for gene function annotation in the literature

Authors:
Luo Si;Danni Yu;Daisuke Kihara;Yi Fang
Affiliations:
Department of Computer Science and Statistics, Purdue University, West Lafayette, USA 47906;Department of Statistics, Purdue University, West Lafayette, USA 47906;Department of Biology and Computer Science, Purdue University, West Lafayette, USA 47906;Department of Computer Science, Purdue University, West Lafayette, USA 47906
Venue:
Information Retrieval
Year:
2008

Citing 8
Cited 2

An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
An application of text categorization methods to gene ontology annotation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The Gene Ontology Categorizer

Bioinformatics
Automatic extraction of gene/protein biological functions from biomedical text

Bioinformatics
Features combination for extracting gene functions from MEDLINE

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Two-phase prediction of protein functions from biological literature based on Gini-Index

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Application of semantic kernels to literature-based gene function annotation

DS'11 Proceedings of the 14th international conference on Discovery science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Annotation of the functions of genes and proteins is an essential step in genome analysis. Information extraction techniques have been proposed to obtain the function information of genes and proteins in the biomedical literature. However, the performance of most information extraction techniques of function annotation in the biomedical literature is not satisfactory due to the large variability in the expression of concepts in the biomedical literature. This paper proposes a framework to improve the gene function annotation in the literature by considering both the textual information in the literature and the functions of genes with sequences similar to a target gene. The new framework collects multiple types of evidence as: (i) textual information about gene functions by matching keywords of the gene functions; (ii) gene function information from the known functions of genes with sequences similar to a target gene; and (iii) the prior probabilities of gene functions to be associated with an arbitrary gene by mining the known gene functions from curated databases. A supervised learning method is utilized to obtain the weights for combining the three types of evidence to assign appropriate Gene Ontology terms for target genes. Empirical studies on two testbeds demonstrate that the combination of sequence similarity scores, function prior probabilities and textual information improves the accuracy of gene function annotation in the literature. The F-measure scores obtained with the proposed framework are substantially higher than the scores of the solutions in prior research.