Paraphrase acquisition via crowdsourcing and machine learning

Authors:
Steven Burrows;Martin Potthast;Benno Stein
Affiliations:
Bauhaus-Universität Weimar, Germany;Bauhaus-Universität Weimar, Germany;Bauhaus-Universität Weimar, Germany
Venue:
ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Year:
2013

Citing 30
Cited 2

Cheating and plagiarism: perceptions and practices of first year IT students

Proceedings of the 7th annual conference on Innovation and technology in computer science education
Summarization beyond sentence extraction: a probabilistic approach to sentence compression

Artificial Intelligence
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
AI Gets a Brain

Queue - AI
iSTART: paraphrase recognition

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
A Metric for Paraphrase Detection

ICCGI '07 Proceedings of the International Multi-Conference on Computing in the Global Information Technology
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Constructing corpora for the development and evaluation of paraphrase systems

Computational Linguistics
Paraphrase recognition via dissimilarity significance classification

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
WordNet::Similarity: measuring the relatedness of concepts

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Measuring the semantic similarity of texts

EMSEE '05 Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment
Leveraging Crowdsourcing: Activation-Supporting Components for IT-Based Ideas Competition

Journal of Management Information Systems
Introduction of a new paraphrase generation tool based on Monte-Carlo sampling

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Paraphrase recognition using machine learning to combine similarity measures

ACLstudent '09 Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Paraphrase identification as probabilistic quasi-synchronous recognition

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Application-driven statistical paraphrase generation

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Soylent: a word processor with a crowd inside

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Creating speech and language data with Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Turker-assisted paraphrasing for English-Arabic machine translation

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Error driven paraphrase annotation using Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
An evaluation framework for plagiarism detection

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Generating phrasal and sentential paraphrases: A survey of data-driven methods

Computational Linguistics
Human computation: a survey and taxonomy of a growing field

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Collecting highly parallel data for paraphrase evaluation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Paraphrase identification on the basis of supervised machine learning techniques

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Multitechnique paraphrase alignment: A contribution to pinpointing sub-sentential paraphrases

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

To paraphrase means to rewrite content while preserving the original meaning. Paraphrasing is important in fields such as text reuse in journalism, anonymizing work, and improving the quality of customer-written reviews. This article contributes to paraphrase acquisition and focuses on two aspects that are not addressed by current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective, and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection competition. This corpus comprises passage-level paraphrases with 4067 positive samples and 3792 negative samples that failed our criteria, using Amazon's Mechanical Turk for crowdsourcing. In this article, we review the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier can correctly distinguish between paraphrased and nonparaphrased samples with 0.980 precision at 0.523 recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but our cost analysis shows that the automation we introduce results in a 18% financial saving and over 100 hours of time returned to the researchers when repeating a similar corpus design. On the other hand, when building an unrelated corpus requiring, say, 25% training data for the automated component, we show that the financial outcome is cost neutral, while still returning over 70 hours of time to the researchers. The work presented here is the first to join the paraphrasing and plagiarism communities.