Gene name ambiguity of eukaryotic nomenclatures

Authors:
Lifeng Chen;Hongfang Liu;Carol Friedman
Affiliations:
Department of BioMedical Informatics, Columbia University New York, NY 10032, USA;Department of Information Systems, University of Maryland Baltimore County, Baltimore, MD 21250, USA;Department of BioMedical Informatics, Columbia University New York, NY 10032, USA
Venue:
Bioinformatics
Year:
2005

Citing 0
Cited 18

Biomimetic design through natural language analysis to facilitate cross-domain information retrieval

Artificial Intelligence for Engineering Design, Analysis and Manufacturing
Natural language processing and visualization in the molecular imaging domain

Journal of Biomedical Informatics
Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification

Journal of Biomedical Informatics
Methodological Review: Extracting interactions between proteins from the literature

Journal of Biomedical Informatics
Knowledge-based gene symbol disambiguation

Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
Rule-Based Protein Term Identification with Help from Automatic Species Tagging

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
@Note: A workbench for Biomedical Text Mining

Journal of Biomedical Informatics
Species disambiguation for biomedical term identification

BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Combining multiple evidence for gene symbol disambiguation

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Annotation and disambiguation of semantic types in biomedical text: a cascaded approach to named entity recognition

NLPXML '06 Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing
Unsupervised gene/protein named entity normalization using automatically extracted dictionaries

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Classifying relations for biomedical named entity disambiguation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Biomedical information retrieval: the BioTracer approach

ITBAM'10 Proceedings of the First international conference on Information technology in bio- and medical informatics
Disambiguation in the biomedical domain: The role of ambiguity type

Journal of Biomedical Informatics
EVEX: a pubmed-scale resource for homology-based generalization of text mining predictions

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Supporting biomedical information retrieval: the bioTracer approach

Transactions on large-scale data- and knowledge-centered systems IV
Efficient classification method for complex biological literature using text and data mining combination

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
ProNormz - An integrated approach for human proteins and protein kinases normalization

Journal of Biomedical Informatics

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. 'Recognition' can be accomplished using pattern matching and machine learning. But for 'identification' these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper, we explore the extent of the problem and suggest ways to address it. Results: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intra-species ambiguity substantially and full names contributed greatly to gene-medical-term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45 000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional 'gene' instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications. Contact: lifeng.chen@dbmi.columbia.edu