Some perspectives on the evaluation of information retrieval systems
Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Results and challenges in Web search evaluation
WWW '99 Proceedings of the eighth international conference on World Wide Web
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval effectiveness on the web
Information Processing and Management: an International Journal
Ranking retrieval systems without relevance judgments
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Information Retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A critical examination of TDT's cost function
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic evaluation of world wide web search services
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Precision Evaluation of Search Engines
World Wide Web
Using manually-built web directories for automatic evaluation of known-item retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The concept of relevance in IR
Journal of the American Society for Information Science and Technology
Methods for ranking information retrieval systems without relevance judgments
Proceedings of the 2003 ACM symposium on Applied computing
Using titles and category names from editor-driven taxonomies for automatic evaluation
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A unified model for metasearch, pooling, and system evaluation
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Forming test collections with no system pooling
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Hourly analysis of a very large topically categorized web query log
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of filtering current news search results
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A General Evaluation Framework for Topical Crawlers
Information Retrieval
A temporal comparison of AltaVista Web searching: Research Articles
Journal of the American Society for Information Science and Technology
A framework for determining necessary query set sizes to evaluate web search effectiveness
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Accurately interpreting clickthrough data as implicit feedback
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Revisiting the effect of topic set size on retrieval error
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic ranking of information retrieval systems using data fusion
Information Processing and Management: an International Journal
How are we searching the world wide web?: a comparison of nine search engine transaction logs
Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic test collections: measuring search effectiveness on the live web
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Temporal analysis of a very large topically categorized Web query log
Journal of the American Society for Information Science and Technology
Repeatable evaluation of information retrieval effectiveness in dynamic environments
Repeatable evaluation of information retrieval effectiveness in dynamic environments
Engineering of Software-Intensive Systems: State of the Art and Research Challenges
Software-Intensive Systems and New Computing Paradigms
Measuring the reusability of test collections
Proceedings of the third ACM international conference on Web search and data mining
Hi-index | 0.00 |
In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.