Distributed web search efficiency by truncating results

  • Authors:
  • Christopher T. Fallen;Gregory B. Newby

  • Affiliations:
  • University of Alaska Fairbanks, Fairbanks, AK;University of Alaska Fairbanks, Fairbanks, AK

  • Venue:
  • Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A large set of Web documents (the TREC GOV2 collection) comes from many separate Internet hosts, such as www.nih.gov and travel.state.gov. There is considerable variability in the number of Web pages (i.e., documents) from each host. In this paper, we present and evaluate a method for setting a maximum number of "hits" that may be presented for each web host. Federated search environments are increasingly common components of digital libraries and in these environments, the benefit of such a maximum is that it can reduce the number of possibly relevant documents presented by each subcollection, without hurting early precision measures such as P@20. Derivation of a maximum number, which is proportional to the subcollection size but not sensitive to different search topics, is made possible by an analysis of patterns of relevance judgment across approximately 17,000 web hosts in GOV2.