Gene name extraction using FlyBase resources

Authors:
Alex Morgan;Lynette Hirschman;Alexander Yeh;Marc Colosimo
Affiliations:
The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA
Venue:
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Year:
2003

Citing 6
Cited 7

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Constructing Biological Knowledge Bases by Extracting Information from Text Sources

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Rutabaga by any other name: extracting biological names

Journal of Biomedical Informatics - Special issue: Sublanguage
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Contrast and variability in gene names

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Medstract: creating large-scale information servers for biomedical libraries

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3

Enhancing performance of protein and gene name recognizers with filtering and integration strategies

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Semi-supervised anaphora resolution in biomedical texts

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Weakly supervised learning methods for improving the quality of gene name normalization data

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Semi-supervised anaphora resolution in biomedical texts

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Methodological Review: Natural Language Processing methods and systems for biomedical ontology learning

Journal of Biomedical Informatics
Topic-Oriented words as features for named entity recognition

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Machine-learning based entity extraction requires a large corpus of annotated training to achieve acceptable results. However, the cost of expert annotation of relevant data, coupled with issues of inter-annotator variability, makes it expensive and time-consuming to create the necessary corpora. We report here on a simple method for the automatic creation of large quantities of imperfect training data for a biological entity (gene or protein) extraction system. We used resources available in the FlyBase model organism database; these resources include a curated lists of genes and the articles from which the entries were drawn, together a synonym lexicon. We applied simple pattern matching to identify gene names in the associated abstracts and filtered these entities using the list of curated entries for the article. This process created a data set that could be used to train a simple Hidden Markov Model (HMM) entity tagger. The results from the HMM tagger were comparable to those reported by other groups (F-measure of 0.75). This method has the advantage of being rapidly transferable to new domains that have similar existing resources.