Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification

Authors:
Martijn J. Schuemie;Barend Mons;Marc Weeber;Jan A. Kors
Affiliations:
Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR, Rotterdam, Netherlands;Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR, Rotterdam, Netherlands;Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR, Rotterdam, Netherlands;Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR, Rotterdam, Netherlands
Venue:
Journal of Biomedical Informatics
Year:
2007

Citing 10
Cited 2

AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Improving the performance of dictionary-based approaches in protein name recognition

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Notions of correctness when evaluating protein name taggers

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
GAPSCORE: finding gene and protein names one word at a time

Bioinformatics
Knowledge discovery by automated identification and ranking of implicit relationships

Bioinformatics
Protein names precisely peeled off free text

Bioinformatics
Gene name ambiguity of eukaryotic nomenclatures

Bioinformatics
Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Using UMLS to construct a generalized hierarchical concept-based dictionary of brain functions for information extraction from the fMRI literature

Journal of Biomedical Informatics
Two learning approaches for protein name extraction

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gene and protein name identification in text requires a dictionary approach to relate synonyms to the same gene or protein, and to link names to external databases. However, existing dictionaries are incomplete. We investigate two complementary methods for automatic generation of a comprehensive dictionary: combination of information from existing gene and protein databases and rule-based generation of spelling variations. Both methods have been reported in literature before, but have hitherto not been combined and evaluated systematically. We combined gene and protein names from several existing databases of four different organisms. The combined dictionaries showed a substantial increase in recall on three different test sets, as compared to any single database. Application of 23 spelling variation rules to the combined dictionaries further increased recall. However, many rules appeared to have no effect and some appear to have a detrimental effect on precision.