Developing a robust part-of-speech tagger for biomedical text

  • Authors:
  • Yoshimasa Tsuruoka;Yuka Tateishi;Jin-Dong Kim;Tomoko Ohta;John McNaught;Sophia Ananiadou;Jun’ichi Tsujii

  • Affiliations:
  • CREST, JST (Japan Science and Technology Agency), Saitama, Japan;CREST, JST (Japan Science and Technology Agency), Saitama, Japan;CREST, JST (Japan Science and Technology Agency), Saitama, Japan;CREST, JST (Japan Science and Technology Agency), Saitama, Japan;School of Informatics, University of Manchester, Manchester, UK;School of Computing, Science and Engineering, Salford University, Salford, Greater Manchester, UK;University of Tokyo, Tokyo, Japan

  • Venue:
  • PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents a part-of-speech tagger which is specifically tuned for biomedical text. We have built the tagger with maximum entropy modeling and a state-of-the-art tagging algorithm. The tagger was trained on a corpus containing newspaper articles and biomedical documents so that it would work well on various types of biomedical text. Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and our tagger exhibits very good precision (97% to 98%) on all these corpora. We also evaluated the robustness of the tagger using recent MEDLINE articles.