Domain-specific language models and lexicons for tagging

  • Authors:
  • Anni R. Coden;Serguei V. Pakhomov;Rie K. Ando;Patrick H. Duffy;Christopher G. Chute

  • Affiliations:
  • IBM, T.J. Watson Research Center, Hawthorne, NY;Division of Medical Informatics Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN;IBM, T.J. Watson Research Center, Hawthorne, NY;Division of Medical Informatics Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN;Division of Medical Informatics Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN

  • Venue:
  • Journal of Biomedical Informatics
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.