Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system

Authors:
Donald Metzler;Eduard Hovy
Affiliations:
University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA
Venue:
Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
Year:
2011

Citing 17
Cited 1

Discovery of inference rules for question-answering

Natural Language Engineering
Extracting paraphrases from a parallel corpus

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Paraphrasing with bilingual parallel corpora

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Syntactic constraints on paraphrases extracted from parallel corpora

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning by reading: a prototype system, performance baseline and lessons learned

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Hitting the right paraphrases in good time

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
Generating phrasal and sentential paraphrases: A survey of data-driven methods

Computational Linguistics
An empirical evaluation of data-driven paraphrase generation techniques

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Paraphrase identification on the basis of supervised machine learning techniques

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Aligning needles in a haystack: paraphrase acquisition across the web

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Diversity-aware evaluation for paraphrase patterns

TIWTE '11 Proceedings of the TextInfer 2011 Workshop on Textual Entailment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Paraphrase acquisition is an important natural language processing (NLP) task that has received a great deal of interest recently. Proposed solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that heavily rely on numerous language-dependent resources. Despite all of the work, there are no publicly available toolkits to support large-scale paraphrase mining research. There has also never been a direct empirical evaluation comparing the merits of simple, scalable approaches and those that make extensive use of expensive NLP resources. This paper introduces Mavuno, a Hadoop-based paraphrase acquisition toolkit that is both scalable and robust. Within the context of Mavuno, we empirically examine the tradeoffs between simple and sophisticated paraphrase acquisition approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that simple approaches have many advantages, including strong effectiveness, good coverage, low redundancy, and ability to handle noisy data.