A hybrid approach to protein name identification in biomedical texts

  • Authors:
  • Kazuhiro Seki;Javed Mostafa

  • Affiliations:
  • Laboratory for Applied Informatics Research, Indiana University, 1320 East Tenth Street, LI 011, Bloomington, Indiana;Laboratory for Applied Informatics Research, Indiana University, 1320 East Tenth Street, LI 011, Bloomington, Indiana

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a hybrid approach to identifying protein names in biomedical texts, which is regarded as a crucial step for text mining. Our approach employs a set of simple heuristics for initial detection of protein names and uses a probabilistic model for locating complete protein names. In addition, a protein name dictionary is complementarily consulted. In contrast to previously proposed methods, our proposed method avoids the use of natural language processing tools such as part-of-speech taggers and syntactic parsers and solely relies on surface clues, so as to reduce the processing overhead. Moreover, we propose a framework to automatically create a large-scale corpus annotated with protein names, which can be then used for training our probabilistic model. We implemented a protein name identification system, named PROTEX, based on our proposed method and evaluated it by comparing with a system developed by other researchers on a common test set. The experiments showed that the automatically constructed corpus is equally useful in training as compared with manually annotated corpora and that effective performance can be achieved in identifying compound protein names with PROTEX.