A Probabilistic Model for Identifying Protein Names and their Name Boundaries

  • Authors:
  • Kazuhiro Seki;Javed Mostafa

  • Affiliations:
  • -;-

  • Venue:
  • CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a method for identifying proteinnames in biomedical texts with an emphasis on detectingprotein name boundaries. We use a probabilistic modelwhich exploits several surface clues characterizing proteinnames and incorporates word classes for generalization.In contrast to previously proposed methods, our approachdoes not rely on natural language processing tools suchas part-of-speech taggers and syntactic parsers, so as toreduce processing overhead and the potential number ofprobabilistic parameters to be estimated. A notion of certaintyis also proposed to improve precision for identification.We implemented a protein name identification systembased on our proposed method, and evaluated the systemon real-world biomedical texts in conjunction with the previouswork. The results showed that overall our system performscomparably to the state-of-the-art protein name identificationsystem and that higher performance is achievedfor compound names. In addition, it is demonstrated thatour system can further improve precision by restricting thesystem output to those names with high certainties.