A framework for determining necessary query set sizes to evaluate web search effectiveness

Authors:
Eric C. Jensen;Steven M. Beitzel;Ophir Frieder;Abdur Chowdhury
Affiliations:
Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;America Online, Inc., Dulles, VA
Venue:
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Year:
2005

Citing 3
Cited 5

Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Measuring Search Engine Quality

Information Retrieval
Hourly analysis of a very large topically categorized web query log

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Surrogate scoring for improved metasearch precision

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Predicting query difficulty on the web by learning visual clues

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
Web search solved?: all result rankings the same?

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a framework of bootstrapped hypothesis testing for estimating the confidence in one web search engine outperforming another over any randomly sampled query set of a given size. To validate this framework, we have constructed and made available a precision-oriented test collection consisting of manual binary relevance judgments for each of the top ten results of ten web search engines across 896 queries and the single best result for each of those queries. Results from this bootstrapping approach over typical query set sizes indicate that examining repeated statistical tests is imperative, as a single test is quite likely to find significant differences that do not necessarily generalize. We also find that the number of queries needed for a repeatable evaluation in a dynamic environment such as the web is much higher than previously studied.