An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Information Retrieval
Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
An application of text categorization methods to gene ontology annotation
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Bioinformatics
Features combination for extracting gene functions from MEDLINE
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Two-phase prediction of protein functions from biological literature based on Gini-Index
Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Application of semantic kernels to literature-based gene function annotation
DS'11 Proceedings of the 14th international conference on Discovery science
Hi-index | 0.00 |
Annotation of the functions of genes and proteins is an essential step in genome analysis. Information extraction techniques have been proposed to obtain the function information of genes and proteins in the biomedical literature. However, the performance of most information extraction techniques of function annotation in the biomedical literature is not satisfactory due to the large variability in the expression of concepts in the biomedical literature. This paper proposes a framework to improve the gene function annotation in the literature by considering both the textual information in the literature and the functions of genes with sequences similar to a target gene. The new framework collects multiple types of evidence as: (i) textual information about gene functions by matching keywords of the gene functions; (ii) gene function information from the known functions of genes with sequences similar to a target gene; and (iii) the prior probabilities of gene functions to be associated with an arbitrary gene by mining the known gene functions from curated databases. A supervised learning method is utilized to obtain the weights for combining the three types of evidence to assign appropriate Gene Ontology terms for target genes. Empirical studies on two testbeds demonstrate that the combination of sequence similarity scores, function prior probabilities and textual information improves the accuracy of gene function annotation in the literature. The F-measure scores obtained with the proposed framework are substantially higher than the scores of the solutions in prior research.