Combining evidence, specificity, and proximity towards the normalization of gene ontology terms in text

Authors:
S. Gaudan;A. Jimeno Yepes;V. Lee;D. Rebholz-Schuhmann
Affiliations:
European Bioinformatics Institute, Cambridge, UK;European Bioinformatics Institute, Cambridge, UK;European Bioinformatics Institute, Cambridge, UK;European Bioinformatics Institute, Cambridge, UK
Venue:
EURASIP Journal on Bioinformatics and Systems Biology
Year:
2008

Citing 6
Cited 3

Some aspects of proximity searching in text retrieval systems

Journal of Information Science
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
More accurate tests for the statistical significance of result differences

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Automatic assignment of biomedical categories: toward a generic approach

Bioinformatics
EBIMed---text crunching to gather facts for proteins from Medline

Bioinformatics
SherLoc

Bioinformatics

Graph-based concept identification and disambiguation for enterprise search

Proceedings of the 19th international conference on World wide web
Unsupervised mapping of sentences to biomedical concepts based on integrated information retrieval model and clustering

Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
Hybrid pattern matching for complex ontology term recognition

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structured information provided by manual annotation of proteins with Gene Ontology concepts represents a high-quality reliable data source for the research community. However, a limited scope of proteins is annotated due to the amount of human resources required to fully annotate each individual gene product from the literature. We introduce a novel method for automatic identification of GO terms in natural language text. The method takes into consideration several features: (1) the evidence for a GO term given by the words occurring in text, (2) the proximity between the words, and (3) the specificity of the GO terms based on their information content. The method has been evaluated on the BioCreAtIvE corpus and has been compared to current state of the art methods. The precision reached 0.34 at a recall of 0.34 for the identified terms at rank 1. In our analysis, we observe that the identification of GO terms in the "cellular component" subbranch of GO is more accurate than for terms from the other two subbranches. This observation is explained by the average number of words forming the terminology over the different subbranches.