Achieving an Almost Correct PoS-Tagged Corpus
TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Improving accuracy in word class tagging through the combination of machine learning systems
Computational Linguistics
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Detecting errors in part-of-speech annotation
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Detecting errors in corpora using support vector machines
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Transformation-based learning in the fast lane
NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Icelandic data driven part of speech tagging
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Representations for category disambiguation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Detecting errors in automatically-parsed dependency relations
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Consistency checking for Treebank alignment
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Hi-index | 0.00 |
The quality of the part-of-speech (PoS) annotation in a corpus is crucial for the development of PoS taggers. In this paper, we experiment with three complementary methods for automatically detecting errors in the PoS annotation for the Icelandic Frequency Dictionary corpus. The first two methods are language independent and we argue that the third method can be adapted to other morphologically complex languages. Once possible errors have been detected, we examine each error candidate and hand-correct the corresponding PoS tag if necessary. Overall, based on the three methods, we hand-correct the PoS tagging of 1,334 tokens (0.23% of the tokens) in the corpus. Furthermore, we re-evaluate existing state-of-the-art PoS taggers on Icelandic text using the corrected corpus.