Automatic detection and correction of errors in dependency tree-banks

Authors:
Alexander Volokh;Günter Neumann
Affiliations:
DFKI, Stuhlsatzenhausweg, Saarbrücken, Germany;DFKI, Stuhlsatzenhausweg, Saarbrücken, Germany
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Year:
2011

Citing 4
Cited 1

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Non-projective dependency parsing using spanning tree algorithms

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
CoNLL-X shared task on multilingual dependency parsing

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning

An automatic approach to treebank error detection using a dependency parser

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the errors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation, it has a very high precision, and thus is in any case beneficial for the quality of the corpus it is applied to. At last, we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary.