Influence of pre-annotation on POS-tagged corpus development

Authors:
Karën Fort;Benoît Sagot
Affiliations:
Nancy / Paris, France;Paris, France
Venue:
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Year:
2010

Citing 6
Cited 1

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Inter-coder agreement for computational linguistics

Computational Linguistics
Semi-supervised training for the averaged perceptron POS tagger

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Complex linguistic annotation --- no easy way out!: a case from Bangla and Hindi POS labeling tasks

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Assessing the benefits of partial automatic pre-labeling for frame-semantic annotation

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop

A prototype tool set to support machine-assisted annotation

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article details a series of carefully designed experiments aiming at evaluating the influence of automatic pre-annotation on the manual part-of-speech annotation of a corpus, both from the quality and the time points of view, with a specific attention drawn to biases. For this purpose, we manually annotated parts of the Penn Treebank corpus (Marcus et al., 1993) under various experimental setups, either from scratch or using various pre-annotations. These experiments confirm and detail the gain in quality observed before (Marcus et al., 1993; Dandapat et al., 2009; Rehbein et al., 2009), while showing that biases do appear and should be taken into account. They finally demonstrate that even a not so accurate tagger can help improving annotation speed.