Boosting precision and recall of dictionary-based protein name recognition

Authors:
Yoshimasa Tsuruoka;Jun'ichi Tsujii
Affiliations:
University of Tokyo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan
Venue:
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Year:
2003

Citing 8
Cited 16

Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
A fast string searching algorithm

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Use of support vector machines in extended named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research

A text-mining system for knowledge discovery from biomedical documents

IBM Systems Journal
Enhancing performance of protein and gene name recognizers with filtering and integration strategies

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Use of morphological analysis in protein name recognition

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Brief Communication: Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature

Computational Biology and Chemistry
Exploiting the contextual cues for bio-entity name recognition in biomedical literature

Journal of Biomedical Informatics
Recognizing names in biomedical texts using hidden Markov model and SVM plus sigmoid

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Annotating multiple types of biomedical entities: a single word classification approach

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Two learning approaches for protein name extraction

Journal of Biomedical Informatics
Nested named entity recognition

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
SimSem: fast approximate string matching in relation to semantic category disambiguation

BioNLP '11 Proceedings of BioNLP 2011 Workshop
SVM-Based biological named entity recognition using minimum edit-distance feature boosted by virtual examples

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Empirical textual mining to protein entities recognition from pubmed corpus

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Active learning technique for biomedical named entity extraction

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
BoDBES: a boosted dictionary-based biomedical entity spotter

Proceedings of the 7th international workshop on Data and text mining in biomedical informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dictionary-based protein name recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a large number of false recognitions mainly caused by short names. (2) low recall due to spelling variation. In this paper, we tackle the former problem by using a machine learning method to filter out false positives. We also present an approximate string searching method to alleviate the latter problem. Experimental results using the GE-NIA corpus show that the filtering using a naive Bayes classifier greatly improves precision with slight loss of recall, resulting in a much better F-score.