Gene name identification and normalization using a model organism database

Authors:
Alexander A. Morgan;Lynette Hirschman;Marc Colosimo;Alexander S. Yeh;Jeff B. Colombe
Affiliations:
MITRE Corporation, 202 Burlington Road, Mall Stop K325, Bedford, MA and Department of Biology, Tufts University, Medford, MA;MITRE Corporation, 202 Burlington Road, Mall Stop K325, Bedford, MA;MITRE Corporation, 202 Burlington Road, Mall Stop K325, Bedford, MA;MITRE Corporation, 202 Burlington Road, Mall Stop K325, Bedford, MA;MITRE Corporation, 202 Burlington Road, Mall Stop K325, Bedford, MA
Venue:
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Year:
2004

Citing 8
Cited 23

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Constructing Biological Knowledge Bases by Extracting Information from Text Sources

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Rutabaga by any other name: extracting biological names

Journal of Biomedical Informatics - Special issue: Sublanguage
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Contrast and variability in gene names

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Medstract: creating large-scale information servers for biomedical libraries

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Protein name tagging for biomedical annotation in text

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13

Introduction: named entity recognition in biomedicine

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
A stopping criterion for active learning

Computer Speech and Language
Rule-Based Protein Term Identification with Help from Automatic Species Tagging

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Human gene name normalization using text matching with automatically extracted synonym dictionaries

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Semi-supervised anaphora resolution in biomedical texts

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Summarizing key concepts using citation sentences

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Bootstrapping and evaluating named entity recognition in the biomedical domain

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Accelerating the annotation of sparse named entities by dynamic sentence selection

BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Annotation of chemical named entities

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Evaluating and combining biomedical named entity recognition systems

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Natural Language Processing as a Foundation of the Semantic Web

Foundations and Trends in Web Science
Weakly supervised learning methods for improving the quality of gene name normalization data

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Adaptive string similarity metrics for biomedical reference resolution

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Corpus design for biomedical natural language processing

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Human gene name normalization using text matching with automatically extracted synonym dictionaries

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Semi-supervised anaphora resolution in biomedical texts

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Summarizing key concepts using citation sentences

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Bootstrapping and evaluating named entity recognition in the biomedical domain

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Distant supervision for relation extraction without labeled data

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
A joint model for normalizing gene and organism mentions in text

WBIE '09 Proceedings of the Workshop on Biomedical Information Extraction
Modeling relations and their mentions without labeled text

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
A bootstrapping approach for training a NER with conditional random fields

EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
Evaluating semantic evaluations: how RTE measures up

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.