Domain-specific language models and lexicons for tagging

Authors:
Anni R. Coden;Serguei V. Pakhomov;Rie K. Ando;Patrick H. Duffy;Christopher G. Chute
Affiliations:
IBM, T.J. Watson Research Center, Hawthorne, NY;Division of Medical Informatics Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN;IBM, T.J. Watson Research Center, Hawthorne, NY;Division of Medical Informatics Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN;Division of Medical Informatics Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
Venue:
Journal of Biomedical Informatics
Year:
2005

Citing 11
Cited 7

A corpus-based approach to language learning

A corpus-based approach to language learning
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing

Computational Linguistics
Extracting molecular binding relationships from biomedical text

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Modeling filled pauses in medical dictations

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
MedPost: a part-of-speech tagger for bioMedical text

Bioinformatics
Creating a test corpus of clinical notes manually tagged for part-of-speech information

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications

High throughput modularized NLP system for clinical text

ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Word sense disambiguation across two domains: Biomedical literature and clinical notes

Journal of Biomedical Informatics
Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model

Journal of Biomedical Informatics
Exploring representation-learning approaches to domain adaptation

DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Methodological Review: Natural Language Processing methods and systems for biomedical ontology learning

Journal of Biomedical Informatics
A token centric part-of-speech tagger for biomedical text

AIME'11 Proceedings of the 13th conference on Artificial intelligence in medicine
Methodological Review: Coreference resolution: A review of general methodologies and applications in the clinical domain

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.