Creating a test corpus of clinical notes manually tagged for part-of-speech information

Authors:
Serguei Pakhomov;Anni Coden;Christopher Chute
Affiliations:
Mayo Clinic, Rochester, MN;IBM, T.J. Watson Research Center, Hawthorne, NY;Mayo Clinic, Rochester, MN
Venue:
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Year:
2004

Citing 7
Cited 2

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing

Domain-specific language models and lexicons for tagging

Journal of Biomedical Informatics
A token centric part-of-speech tagger for biomedical text

AIME'11 Proceedings of the 13th conference on Artificial intelligence in medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech information. We describe and discuss the process of training three domain experts to perform linguistic annotation. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. We also present preliminary experimental results indicating the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of medical text.