High-performance tagging on medical texts

Authors:
Udo Hahn;Joachim Wermter
Affiliations:
Friedrich-Schiller-Universität Jena, Jena, Germany;Friedrich-Schiller-Universität Jena, Jena, Germany
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 6
Cited 3

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
An annotation scheme for free word order languages

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Comparing a linguistic and a stochastic tagger

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

Characteristics and analysis of Finnish and Swedish clinical intensive care nursing narratives

Louhi '10 Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents
A token centric part-of-speech tagger for biomedical text

AIME'11 Proceedings of the 13th conference on Artificial intelligence in medicine
Annotation schemes to encode domain knowledge in medical narratives

LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

We ran both Brill's rule-based tagger and TNT, a statistical tagger, with a default German newspaper-language model on a medical text corpus. Supplied with limited lexicon resources, TNT outperforms the Brill tagger with state-of-the-art performance figures (close to 97% accuracy). We then trained TNT on a large annotated medical text corpus, with a slightly extended tagset that captures certain medical language particularities, and achieved 98% tagging accuracy. Hence, statistical off-the-shelf POS taggers cannot only be immediately reused for medical NLP, but they also -- when trained on medical corpora -- achieve a higher performance level than for the newspaper genre.