Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system

  • Authors:
  • Donald Metzler;Eduard Hovy

  • Affiliations:
  • University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA

  • Venue:
  • Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Paraphrase acquisition is an important natural language processing (NLP) task that has received a great deal of interest recently. Proposed solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that heavily rely on numerous language-dependent resources. Despite all of the work, there are no publicly available toolkits to support large-scale paraphrase mining research. There has also never been a direct empirical evaluation comparing the merits of simple, scalable approaches and those that make extensive use of expensive NLP resources. This paper introduces Mavuno, a Hadoop-based paraphrase acquisition toolkit that is both scalable and robust. Within the context of Mavuno, we empirically examine the tradeoffs between simple and sophisticated paraphrase acquisition approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that simple approaches have many advantages, including strong effectiveness, good coverage, low redundancy, and ability to handle noisy data.