Aligning needles in a haystack: paraphrase acquisition across the web

Authors:
Marius Paşca;Péter Dienes
Affiliations:
Google Inc., Mountain View, California;Google Inc., Mountain View, California
Venue:
IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Year:
2005

Citing 12
Cited 19

Improving automatic query expansion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Building a question answering test collection

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Expansion of multi-word terms for indexing and retrieval using morphology and syntax

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Lexical query paraphrasing for document retrieval

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Extracting paraphrases from a parallel corpus

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Paraphrase acquisition for information extraction

PARAPHRASE '03 Proceedings of the second international workshop on Paraphrasing - Volume 16
Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Corpus and evaluation measures for multiple document summarization with multiple sources

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic paraphrase acquisition from news articles

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Gram-free synonym extraction via suffix arrays

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Paraphrasing with search engine query logs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
Generating phrasal and sentential paraphrases: A survey of data-driven methods

Computational Linguistics
SPIDER: a system for paraphrasing in document editing and revision-applicability in machine translation pre-editing

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
An empirical evaluation of data-driven paraphrase generation techniques

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system

Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
Paraphrase identification on the basis of supervised machine learning techniques

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Unsupervised mining of lexical variants from noisy text

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Diversity-aware evaluation for paraphrase patterns

TIWTE '11 Proceedings of the TextInfer 2011 Workshop on Textual Entailment
Reranking bilingually extracted paraphrases using monolingual distributional similarity

GEMS '11 Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics
ATLAS - human language technologies integrated within a multilingual web content management system

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Automatically mining question reformulation patterns from search log data

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Enlarging paraphrase collections through generalization and instantiation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Generalizing sub-sentential paraphrase acquisition across original signal type of text pairs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Ensemble semantics for large-scale unsupervised relation extraction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Distributional phrasal paraphrase generation for statistical machine translation

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Multitechnique paraphrase alignment: A contribution to pinpointing sub-sentential paraphrases

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a lightweight method for unsupervised extraction of paraphrases from arbitrary textual Web documents. The method differs from previous approaches to paraphrase acquisition in that 1) it removes the assumptions on the quality of the input data, by using inherently noisy, unreliable Web documents rather than clean, trustworthy, properly formatted documents; and 2) it does not require any explicit clue indicating which documents are likely to encode parallel paraphrases, as they report on the same events or describe the same stories. Large sets of paraphrases are collected through exhaustive pairwise alignment of small needles, i.e., sentence fragments, across a haystack of Web document sentences. The paper describes experiments on a set of about one billion Web documents, and evaluates the extracted paraphrases in a natural-language Web search application.