Distributed web search efficiency by truncating results

Authors:
Christopher T. Fallen;Gregory B. Newby
Affiliations:
University of Alaska Fairbanks, Fairbanks, AK;University of Alaska Fairbanks, Fairbanks, AK
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 10
Cited 0

An information system for corporate users: wide area information servers

Online
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Inferring probability of relevance using the method of logistic regression

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Database merging strategy based on logistic regression

Information Processing and Management: an International Journal
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval Systems: Theory and Implementation

Information Retrieval Systems: Theory and Implementation
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Information retrieval at Boeing: plans and successes

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
High accuracy retrieval with multiple nested ranker

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Exploring the dark side of the web: collection and analysis of u.s. extremist online forums

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large set of Web documents (the TREC GOV2 collection) comes from many separate Internet hosts, such as www.nih.gov and travel.state.gov. There is considerable variability in the number of Web pages (i.e., documents) from each host. In this paper, we present and evaluate a method for setting a maximum number of "hits" that may be presented for each web host. Federated search environments are increasingly common components of digital libraries and in these environments, the benefit of such a maximum is that it can reduce the number of possibly relevant documents presented by each subcollection, without hurting early precision measures such as P@20. Derivation of a maximum number, which is proportional to the subcollection size but not sensitive to different search topics, is made possible by an analysis of patterns of relevance judgment across approximately 17,000 web hosts in GOV2.