Human gene name normalization using text matching with automatically extracted synonym dictionaries

Authors:
Haw-ren Fang;Kevin Murphy;Yang Jin;Jessica S. Kim;Peter S. White
Affiliations:
University of Maryland, College Park, MD;Children's Hospital of Philadelphia, Philadelphia, PA;Children's Hospital of Philadelphia, Philadelphia, PA;Children's Hospital of Philadelphia, Philadelphia, PA;Children's Hospital of Philadelphia, Philadelphia, PA
Venue:
BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Year:
2006

Citing 5
Cited 5

Gene name identification and normalization using a model organism database

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
An entity tagger for recognizing acquired genomic variations in cancer literature

Bioinformatics
Contrast and variability in gene names

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Adaptive string similarity metrics for biomedical reference resolution

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Unsupervised gene/protein named entity normalization using automatically extracted dictionaries

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics

Gene ontology annotation as text categorization: An empirical study

Information Processing and Management: an International Journal
Rule-Based Protein Term Identification with Help from Automatic Species Tagging

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
A joint model for normalizing gene and organism mentions in text

WBIE '09 Proceedings of the Workshop on Biomedical Information Extraction
ProNormz - An integrated approach for human proteins and protein kinases normalization

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The identification of genes in biomedical text typically consists of two stages: identifying gene mentions and normalization of gene names. We have created an automated process that takes the output of named entity recognition (NER) systems designed to identify genes and normalizes them to standard referents. The system identifies human gene synonyms from online databases to generate an extensive synonym lexicon. The lexicon is then compared to a list of candidate gene mentions using various string transformations that can be applied and chained in a flexible order, followed by exact string matching or approximate string matching. Using a gold standard of MEDLINE abstracts manually tagged and normalized for mentions of human genes, a combined tagging and normalization system achieved 0.669 F-measure (0.718 precision and 0.626 recall) at the mention level, and 0.901 F-measure (0.957 precision and 0.857 recall) at the document level for documents used for tagger training.