An Algorithm that Learns What‘s in a Name
Machine Learning - Special issue on natural language learning
Constructing Biological Knowledge Bases by Extracting Information from Text Sources
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Rutabaga by any other name: extracting biological names
Journal of Biomedical Informatics - Special issue: Sublanguage
Extracting the names of genes and gene products with a hidden Markov model
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Contrast and variability in gene names
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Medstract: creating large-scale information servers for biomedical libraries
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Enhancing performance of protein and gene name recognizers with filtering and integration strategies
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Term identification in the biomedical literature
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Semi-supervised anaphora resolution in biomedical texts
BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Weakly supervised learning methods for improving the quality of gene name normalization data
ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Semi-supervised anaphora resolution in biomedical texts
LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Journal of Biomedical Informatics
Topic-Oriented words as features for named entity recognition
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Hi-index | 0.00 |
Machine-learning based entity extraction requires a large corpus of annotated training to achieve acceptable results. However, the cost of expert annotation of relevant data, coupled with issues of inter-annotator variability, makes it expensive and time-consuming to create the necessary corpora. We report here on a simple method for the automatic creation of large quantities of imperfect training data for a biological entity (gene or protein) extraction system. We used resources available in the FlyBase model organism database; these resources include a curated lists of genes and the articles from which the entries were drawn, together a synonym lexicon. We applied simple pattern matching to identify gene names in the associated abstracts and filtered these entities using the list of curated entries for the article. This process created a data set that could be used to train a simple Hidden Markov Model (HMM) entity tagger. The results from the HMM tagger were comparable to those reported by other groups (F-measure of 0.75). This method has the advantage of being rapidly transferable to new domains that have similar existing resources.