(Semi-)automatic detection of errors in PoS-tagged corpora

Authors:
Pavel Květoň;Karel Oliva
Affiliations:
Austrian Research Institute for Artificial Intelligence (OeFAI), Wien, Austria;Austrian Research Institute for Artificial Intelligence (OeFAI), Wien, Austria
Venue:
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Year:
2002

Citing 3
Cited 2

TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
An annotation scheme for free word order languages

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Automatic refinement of a POS tagger using a reliable parser and plain text corpora

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Detection of strange and wrong automatic part-of-speech tagging

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Reducing the need for double annotation

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a simple yet in practice very efficient technique serving for automatic detection of those positions in a part-of-speech tagged corpus where an error is to be suspected. The approach is based on the idea of learning and later application of "negative bigrams", i.e. on the search for pairs of adjacent tags which constitute an incorrect configuration in a text of a particular language (in English, e.g., the bigram ARTICLE - FINITE VERB). Further, the paper describes the generalization of the "negative bigrams" into "negative n-grams", for any natural n, which indeed provides a powerful tool for error detection in a corpus. The implementation is also discussed, as well as evaluation of results of the approach when used for error detection in the NEGRA® corpus of German, and the general implications for the quality of results of statistical taggers. Illustrative examples in the text are taken from German, and hence at least a basic command of this language would be helpful for their understanding - due to the complexity of the necessary accompanying explanation, the examples are neither glossed nor translated. However, the central ideas of the paper should be understandable also without any knowledge of German.