Choices in batch information retrieval evaluation

Authors:
Falk Scholer;Alistair Moffat;Paul Thomas
Affiliations:
RMIT University, Australia;The University of Melbourne, Australia;CSIRO & ANU, Australia
Venue:
Proceedings of the 18th Australasian Document Computing Symposium
Year:
2013

Citing 33
Cited 0

Variations in relevance assessments and the measurement of retrieval effectiveness

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Why batch and user evaluations do not give the same results

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
When will information retrieval be "good enough"?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Reliable information retrieval evaluation with incomplete and biased judgements

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Alternatives to Bpref

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling for large collections

Information Retrieval
User adaptation: good results from poor systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Learning query intent from regularized click graphs

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A simple and efficient sampling method for estimating AP and NDCG

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Cost and benefit analysis of mediated enterprise search

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Including summaries in system evaluation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Beyond DCG: user behavior as a predictor of a successful search

Proceedings of the third ACM international conference on Web search and data mining
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
A comparative analysis of cascade measures for novelty and diversity

Proceedings of the fourth ACM international conference on Web search and data mining
In search of quality in crowdsourcing for search engine evaluation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Relative effect of spam and irrelevant documents on user interaction with search engines

Proceedings of the 20th ACM international conference on Information and knowledge management
A classification of IR effectiveness metrics

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
On judgments obtained from a commercial search engine

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Grannies, tanning beds, tattoos and NASCAR: evaluation of search tasks with varying levels of cognitive complexity

Proceedings of the 4th Information Interaction in Context Symposium
Models and metrics: IR evaluation as a user process

Proceedings of the Seventeenth Australasian Document Computing Symposium
Users versus models: what observation tells us about effectiveness metrics

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores. Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality. We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.