Repeatable evaluation of information retrieval effectiveness in dynamic environments

Authors:
Ophir Frieder;Eric C. Jensen
Affiliations:
Illinois Institute of Technology;Illinois Institute of Technology
Venue:
Repeatable evaluation of information retrieval effectiveness in dynamic environments
Year:
2006

Citing 0
Cited 3

Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of conditions (often binary, A vs. B) without system pooling, not intending to construct reusable test collections. The query sample sizes required in such evaluations can be reliably estimated by leveraging the simple bootstrap estimate of the reproducibility probability (observed power) of hypothesis tests. However, they are typically much larger than those that are needed for static collections. We propose a semiautomatic evaluation framework to reduce this effort by enabling intelligent evaluation strategies. We validate this framework against a manual evaluation of the top ten results of ten web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined, even naively, from web taxonomies reduces both the chances of missing a correct binary conclusion and those of finding an errant conclusion by approximately half.