Boosting precision and recall of dictionary-based protein name recognition

  • Authors:
  • Yoshimasa Tsuruoka;Jun'ichi Tsujii

  • Affiliations:
  • University of Tokyo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan

  • Venue:
  • BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dictionary-based protein name recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a large number of false recognitions mainly caused by short names. (2) low recall due to spelling variation. In this paper, we tackle the former problem by using a machine learning method to filter out false positives. We also present an approximate string searching method to alleviate the latter problem. Experimental results using the GE-NIA corpus show that the filtering using a naive Bayes classifier greatly improves precision with slight loss of recall, resulting in a much better F-score.