Hypergeometric language model and zipf-like scoring function for web document similarity retrieval

  • Authors:
  • Felipe Bravo-Marquez;Gaston L'Huillier;Sebastián A. Ríos;Juan D. Velásquez

  • Affiliations:
  • University of Chile, Department of Industrial Engineering, Santiago, Chile;University of Chile, Department of Industrial Engineering, Santiago, Chile;University of Chile, Department of Industrial Engineering, Santiago, Chile;University of Chile, Department of Industrial Engineering, Santiago, Chile

  • Venue:
  • SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The retrieval of similar documents in the Web from a given document is different in many aspects from information retrieval based on queries generated by regular search engine users. In this work, a new method is proposed for Web similarity document retrieval based on generative language models and meta search engines. Probabilistic language models are used as a random query generator for the given document. Queries are submitted to a customizable set of Web search engines. Once all results obtained are gathered, its evaluation is determined by a proposed scoring function based on the Zipf law. Results obtained showed that the proposed methodology for query generation and scoring procedure solves the problem with acceptable levels of precision.