Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Finding information on the World Wide Web: the retrieval effectiveness of search engines
Information Processing and Management: an International Journal
Results and challenges in Web search evaluation
WWW '99 Proceedings of the eighth international conference on World Wide Web
First 20 precision among World Wide Web search services (search engines)
Journal of the American Society for Information Science
Analysis of a very large web search engine query log
ACM SIGIR Forum
Real life, real users, and real needs: a study and analysis of user queries on the web
Information Processing and Management: an International Journal
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Interactive Internet search: keyword, directory and query reformulation mechanisms compared
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A case study in web search using TREC algorithms
Proceedings of the 10th international conference on World Wide Web
Evaluation by highly relevant documents
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Why batch and user evaluations do not give the same results
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Automatic evaluation of world wide web search services
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Measuring Search Engine Quality
Information Retrieval
Precision Evaluation of Search Engines
World Wide Web
ACM SIGIR Forum
Using manually-built web directories for automatic evaluation of known-item retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Scaling IR-system evaluation using term relevance sets
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Hourly analysis of a very large topically categorized web query log
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic ranking of information retrieval systems using data fusion
Information Processing and Management: an International Journal
A machine learning based approach to evaluating retrieval systems
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Repeatable evaluation of search services in dynamic environments
ACM Transactions on Information Systems (TOIS)
Comparative analysis of clicks and judgments for IR evaluation
Proceedings of the 2009 workshop on Web Search Click Data
An overview of Web search evaluation methods
Computers and Electrical Engineering
Generating pseudo test collections for learning to rank scientific articles
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Hi-index | 0.01 |
Evaluation of IR systems has always been difficult because of the need for manually assessed relevance judgments. The advent of large editor-driven taxonomies on the web opens the door to a new evaluation approach. We use the ODP (Open Directory Project) taxonomy to find sets of pseudo-relevant documents via one of two assumptions: 1) taxonomy entries are relevant to a given query if their editor-entered titles exactly match the query, or 2) all entries in a leaf-level taxonomy category are relevant to a given query if the category title exactly matches the query. We compare and contrast these two methodologies by evaluating six web search engines on a sample from an America Online log of ten million web queries, using MRR measures for the first method and precision-based measures for the second. We show that this technique is stable with respect to the query set selected and correlated with a reasonably large manual evaluation.