High-performance tagging on medical texts

  • Authors:
  • Udo Hahn;Joachim Wermter

  • Affiliations:
  • Friedrich-Schiller-Universität Jena, Jena, Germany;Friedrich-Schiller-Universität Jena, Jena, Germany

  • Venue:
  • COLING '04 Proceedings of the 20th international conference on Computational Linguistics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We ran both Brill's rule-based tagger and TNT, a statistical tagger, with a default German newspaper-language model on a medical text corpus. Supplied with limited lexicon resources, TNT outperforms the Brill tagger with state-of-the-art performance figures (close to 97% accuracy). We then trained TNT on a large annotated medical text corpus, with a slightly extended tagset that captures certain medical language particularities, and achieved 98% tagging accuracy. Hence, statistical off-the-shelf POS taggers cannot only be immediately reused for medical NLP, but they also -- when trained on medical corpora -- achieve a higher performance level than for the newspaper genre.