A solution to the exact match on rare item searches: introducing the lost sheep algorithm

  • Authors:
  • Morten Goodwin

  • Affiliations:
  • Tingtun AS, Lillesand, Norway

  • Venue:
  • Proceedings of the International Conference on Web Intelligence, Mining and Semantics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes an approach for finding a single web page in a large web site or a cloud of web pages. We formalize this problem and map it to the exact match on rare item searches (EMRIS). The EMRIS is not much addressed in the literature, but many closely related problems exists. This paper presents a state-of-the-art survey on related problems in the fields of information retrieval, web page classification and directed search. As a solution to the EMRIS, this paper presents an innovative algorithm called the lost sheep. The lost sheep is specifically designed to work in web sites with of links, link texts and web pages. It works as a pre-classifier on link texts to decide if a web page is candidate for further evaluation. This paper also defines sound metrics to evaluated the EMRIS. The lost sheep outperforms all comparable algorithms both when it comes to maximizing accuracy and minimizing the number of downloaded pages.