Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora

  • Authors:
  • Matteo Negri;Luisa Bentivogli;Yashar Mehdad;Danilo Giampiccolo;Alessandro Marchetti

  • Affiliations:
  • FBK-irst, Trento, Italy;FBK-irst, Trento, Italy;FBK-irst and University of Trento, Trento, Italy;CELCT, Trento, Italy;CELCT, Trento, Italy

  • Venue:
  • EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the creation of cross-lingual textual entailment corpora by means of crowd-sourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.