Correcting a PoS-tagged corpus using three complementary methods

Authors:
Hrafn Loftsson
Affiliations:
Reykjavik University, Reykjavik, Iceland
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 9
Cited 2

Achieving an Almost Correct PoS-Tagged Corpus

TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Improving accuracy in word class tagging through the combination of machine learning systems

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Detecting errors in part-of-speech annotation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Detecting errors in corpora using support vector machines

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Transformation-based learning in the fast lane

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Icelandic data driven part of speech tagging

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Representations for category disambiguation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1

Detecting errors in automatically-parsed dependency relations

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Consistency checking for Treebank alignment

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

The quality of the part-of-speech (PoS) annotation in a corpus is crucial for the development of PoS taggers. In this paper, we experiment with three complementary methods for automatically detecting errors in the PoS annotation for the Icelandic Frequency Dictionary corpus. The first two methods are language independent and we argue that the third method can be adapted to other morphologically complex languages. Once possible errors have been detected, we examine each error candidate and hand-correct the corresponding PoS tag if necessary. Overall, based on the three methods, we hand-correct the PoS tagging of 1,334 tokens (0.23% of the tokens) in the corpus. Furthermore, we re-evaluate existing state-of-the-art PoS taggers on Icelandic text using the corrected corpus.