Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora

Authors:
Matteo Negri;Luisa Bentivogli;Yashar Mehdad;Danilo Giampiccolo;Alessandro Marchetti
Affiliations:
FBK-irst, Trento, Italy;FBK-irst, Trento, Italy;FBK-irst and University of Trento, Trento, Italy;CELCT, Trento, Italy;CELCT, Trento, Italy
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 9
Cited 16

Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The lie detector: explorations in the automatic recognition of deceptive language

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Towards cross-lingual textual entailment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
"Ask not what textual entailment can do for you..."

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Creating speech and language data with Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Cheap facts and counter-facts

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Using Mechanical Turk to build machine translation evaluation sets

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Creating a bi-lingual entailment corpus through translations with Mechanical Turk: $100 for a 10-day rush

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Using bilingual parallel corpora for cross-lingual textual entailment

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

CoSyne: a framework for multilingual content synchronization of wikis

Proceedings of the 7th International Symposium on Wikis and Open Collaboration
Crowdsourcing research opportunities: lessons from natural language processing

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Semeval-2012 task 8: cross-lingual textual entailment for content synchronization

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
HDU: cross-lingual textual entailment with SMT features

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
UAlacant: using online machine translation for cross-lingual textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
CELI: an experiment with cross language textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
FBK: cross-lingual textual entailment without translation

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
BUAP: lexical and semantic similarity for cross-lingual textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
DirRelCond3: detecting textual entailment across languages with conditions on directional text relatedness scores

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
ICT: a translation based method for cross-lingual textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
SAGAN: a machine translation approach for cross-lingual textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Ecological evaluation of persuasive messages using Google AdWords

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Detecting semantic equivalence and information disparity in cross-lingual documents

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Crowdsourcing inference-rule evaluation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Managing information disparity in multilingual document collections

ACM Transactions on Speech and Language Processing (TSLP)
Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows

International Journal on Semantic Web & Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the creation of cross-lingual textual entailment corpora by means of crowd-sourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.