Automatic retrieval of similar content using search engine query interface

  • Authors:
  • Ali Dasdan;Paolo D'Alberto;Santanu Kolay;Chris Drome

  • Affiliations:
  • Yahoo! Inc., Sunnyvale, CA, USA;Yahoo! Inc., Sunnyvale, CA, USA;Yahoo! Inc., Sunnyvale, CA, USA;Yahoo! Inc., Sunnyvale, CA, USA

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.