A token centric part-of-speech tagger for biomedical text

Authors:
Neil Barrett;Jens Weber-Jahnke
Affiliations:
Department of Computer Science, University of Victoria, Victoria, Canada;Department of Computer Science, University of Victoria, Victoria, Canada
Venue:
AIME'11 Proceedings of the 13th conference on Artificial intelligence in medicine
Year:
2011

Citing 14
Cited 0

Monitoring free-text data using medical language processing

Computers and Biomedical Research
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
MedPost: a part-of-speech tagger for bioMedical text

Bioinformatics
NLTK: the Natural Language Toolkit

ETMTNLP '02 Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1
Domain-specific language models and lexicons for tagging

Journal of Biomedical Informatics
The importance of the lexicon in tagging biological text

Natural Language Engineering
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
High-performance tagging on medical texts

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Creating a test corpus of clinical notes manually tagged for part-of-speech information

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Subdomain adaptation of a POS tagger with a small corpus

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Adaptation of POS tagging for multiple BioMedical domains

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A difficulty with part-of-speech (POS) tagging of biomedical text is accessing and annotating appropriate training corpora. The latter may result in POS taggers trained on corpora that differ from the tagger's target biomedical text. In such cases where training and target corpora differ tagging accuracy decreases. We present a POS tagger that is more accurate than two frequently used biomedical POS taggers (Brill and TnT) when trained on a non-biomedical corpus and evaluated on the MedPost corpus (our tagger: 81.0%, Brill: 77.5%, TnT: 78.2%). Our tagger is also significantly faster than the next best tagger (TnT). It estimates a tag's likelihood for a token by combining prior probabilities (using existing methods) and token probabilities calculated in part using a Naive Bayes classifier. Our results suggest that future work should reexamine POS tagging methods for biomedical text. This differs from the work to date that has focused on retraining existing POS taggers.