WordNet: a lexical database for English
Communications of the ACM
Building a question answering test collection
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Head-driven statistical models for natural language parsing
Head-driven statistical models for natural language parsing
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Expansion of multi-word terms for indexing and retrieval using morphology and syntax
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Acquisition of categorized named entities for web search
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Extracting paraphrases from a parallel corpus
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Automatic labeling of semantic roles
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Paraphrase acquisition for information extraction
PARAPHRASE '03 Proceedings of the second international workshop on Paraphrasing - Volume 16
Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Corpus and evaluation measures for multiple document summarization with multiple sources
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic paraphrase acquisition from news articles
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Text relatedness based on a word thesaurus
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
Near-synonyms or paraphrases are beneficial in a variety of natural language and information retrieval applications, but so far their acquisition has been confined to clean, trustworthy collections of documents with explicit external attributes. When such attributes are available, such as similar time stamps associated to a pair of news articles, previous approaches rely on them as signals of potentially high content overlap between the articles, often embodied in sentences that are only slight, paraphrase-based variations of each other. This paper introduces a new unsupervised method for extracting paraphrases from an information source of completely different nature and scale, namely unstructured text across arbitrary Web textual documents. In this case, no useful external attributes are consistently available for all documents. Instead, the paper introduces linguistically-motivated text anchors, which are identified automatically within the documents. The anchors are instrumental in the derivation of paraphrases through lightweight pairwise alignment of Web sentence fragments. A large set of categorized names, acquired separately from Web documents, serves as a filtering mechanism for improving the quality of the paraphrases. A set of paraphrases extracted from about a billion Web documents is evaluated both manually and through its impact on a natural-language Web search application.