Subdomain adaptation of a POS tagger with a small corpus

Authors:
Yuka Tateisi;Yoshimasa Tsuruoka;Jun-ichi Tsujii
Affiliations:
Kogakuin University, Shinjuku-ku, Tokyo, Japan;University of Manchester, Manchester, U.K.;University of Tokyo, Tokyo, Japan
Venue:
BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Year:
2006

Citing 7
Cited 1

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Exploiting auxiliary distributions in stochastic unification-based grammars

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Evaluation and extension of maximum entropy models with inequality constraints

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Developing a robust part-of-speech tagger for biomedical text

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Parsing biomedical literature

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Adapting a probabilistic disambiguation model of an HPSG parser to a new domain

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

A token centric part-of-speech tagger for biomedical text

AIME'11 Proceedings of the 13th conference on Artificial intelligence in medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

For the domain of biomedical research abstracts, two large corpora, namely GENIA (Kim et al 2003) and Penn BioIE (Kulik et al 2004) are available. Both are basically in human domain and the performance of systems trained on these corpora when they are applied to abstracts dealing with other species is unknown. In machine-learning-based systems, re-training the model with addition of corpora in the target domain has achieved promising results (e.g. Tsuruoka et al 2005, Lease et al 2005). In this paper, we compare two methods for adaptation of POS taggers trained for GENIA and Penn BioIE corpora to Drosophila melanogaster (fruit fly) domain.