Web-scale distributional similarity and entity set expansion

  • Authors:
  • Patrick Pantel;Eric Crestan;Arkady Borkovsky;Ana-Maria Popescu;Vishnu Vyas

  • Affiliations:
  • Yahoo! Labs, Sunnyvale, CA;Yahoo! Labs, Sunnyvale, CA;Yandex Labs, Burlingame, CA;Yahoo! Labs, Sunnyvale, CA;Yahoo! Labs, Sunnyvale, CA

  • Venue:
  • EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms is computed in 50 hours using 200 quad-core nodes. We apply the learned similarity matrix to the task of automatic set expansion and present a large empirical study to quantify the effect on expansion performance of corpus size, corpus quality, seed composition and seed size. We make public an experimental testbed for set expansion analysis that includes a large collection of diverse entity sets extracted from Wikipedia.