A corpus-based approach to language learning
A corpus-based approach to language learning
Some advances in transformation-based part of speech tagging
AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Coping with ambiguity and unknown words through probabilistic models
Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing
Computational Linguistics
Extracting molecular binding relationships from biomedical text
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A practical part-of-speech tagger
ANLC '92 Proceedings of the third conference on Applied natural language processing
Modeling filled pauses in medical dictations
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
MedPost: a part-of-speech tagger for bioMedical text
Bioinformatics
Creating a test corpus of clinical notes manually tagged for part-of-speech information
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
High throughput modularized NLP system for clinical text
ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Word sense disambiguation across two domains: Biomedical literature and clinical notes
Journal of Biomedical Informatics
Journal of Biomedical Informatics
Exploring representation-learning approaches to domain adaptation
DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Journal of Biomedical Informatics
A token centric part-of-speech tagger for biomedical text
AIME'11 Proceedings of the 13th conference on Artificial intelligence in medicine
Journal of Biomedical Informatics
Hi-index | 0.00 |
Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.