The importance of the lexicon in tagging biological text

Authors:
Lawrence H. Smith;Thomas C. Rindflesch;W. John Wilbur
Affiliations:
National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA e-mail: lsmith@ncbi.nlm.nih.gov, wilbur@ncbi.nlm.nih.gov;Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD, USA tcr@nlm.nih.gov;National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA e-mail: lsmith@ncbi.nlm.nih.gov, wilbur@ncbi.nlm.nih.gov
Venue:
Natural Language Engineering
Year:
2006

Citing 9
Cited 5

Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Extracting molecular binding relationships from biomedical text

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Does Baum-Welch re-estimation help taggers?

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
MedPost: a part-of-speech tagger for bioMedical text

Bioinformatics
Tagging text with a probabilistic model

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference

Syntactic sentence compression in the biomedical domain: facilitating access to related articles

Information Retrieval
Adaptation of POS tagging for multiple BioMedical domains

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Finding related sentence pairs in MEDLINE

Information Retrieval
A token centric part-of-speech tagger for biomedical text

AIME'11 Proceedings of the 13th conference on Artificial intelligence in medicine
Degree centrality for semantic abstraction summarization of therapeutic studies

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

A part-of-speech tagger is a fundamental and indispensable tool in computational linguistics, typically employed at the critical early stages of processing. Although taggers are widely available that achieve high accuracy in very general domains, these do not perform nearly as well when applied to novel specialized domains, and this is especially true with biological text. We present a stochastic tagger that achieves over 97.44% accuracy on MEDLINE abstracts. A primary component of the tagger is its lexicon which enumerates the permitted parts-of-speech for the 10000 words most frequently occurring in MEDLINE. We present evidence for the conclusion that the lexicon is as vital to tagger accuracy as a training corpus, and more important than previously thought.