Do user preferences and evaluation measures line up?

Authors:
Mark Sanderson;Monica Lestari Paramita;Paul Clough;Evangelos Kanoulas
Affiliations:
University of Sheffield, Sheffield, United Kingdom;University of Sheffield, Sheffield, United Kingdom;University of Sheffield, Sheffield, United Kingdom;University of Sheffield, Sheffield, United Kingdom
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 19
Cited 18

Evaluation measures for interactive information retrieval

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
User-defined relevance criteria: an exploratory study

Journal of the American Society for Information Science - Special issue: relevance research
Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Why batch and user evaluations do not give the same results

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A taxonomy of web search

ACM SIGIR Forum
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
When will information retrieval be "good enough"?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Evaluation by comparing result sets in context

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The good and the bad system: does the test collection predict users' effectiveness?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
User adaptation: good results from poor systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A new rank correlation coefficient for information retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and diversity in information retrieval evaluation

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How does clickthrough data reflect retrieval quality?

Proceedings of the 17th ACM conference on Information and knowledge management
Crowdsourcing for relevance evaluation

ACM SIGIR Forum
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
A dynamic bayesian network click model for web search ranking

Proceedings of the 18th international conference on World wide web

A methodology for evaluating aggregated search results

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Find it if you can: a game for modeling different types of web search success using interaction data

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evaluating diversified search results using per-intent graded relevance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
User perspectives on query difficulty

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
The status of retrieval evaluation in the patent domain

Proceedings of the 4th workshop on Patent information retrieval
Evaluation with informational and navigational intents

Proceedings of the 21st international conference on World Wide Web
Evaluating aggregated search pages

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using preference judgments for novel document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
User model-based metrics for offline query suggestion evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Preference based evaluation measures for novelty and diversity

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Relevance dimensions in preference-based IR evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Which vertical search engines are relevant?

Proceedings of the 22nd international conference on World Wide Web
SRbench--a benchmark for soundtrack recommendation systems

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
On the reliability and intuitiveness of aggregated search metrics

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Diversified search evaluation: lessons from the NTCIR-9 INTENT task

Information Retrieval
Composite retrieval of heterogeneous web search

Proceedings of the 23rd international conference on World wide web
Contextual and dimensional relevance judgments for reusable SERP-level evaluation

Proceedings of the 23rd international conference on World wide web
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents results comparing user preference for search engine rankings with measures of effectiveness computed from a test collection. It establishes that preferences and evaluation measures correlate: systems measured as better on a test collection are preferred by users. This correlation is established for both "conventional web retrieval" and for retrieval that emphasizes diverse results. The nDCG measure is found to correlate best with user preferences compared to a selection of other well known measures. Unlike previous studies in this area, this examination involved a large population of users, gathered through crowd sourcing, exposed to a wide range of retrieval systems, test collections and search tasks. Reasons for user preferences were also gathered and analyzed. The work revealed a number of new results, but also showed that there is much scope for future work refining effectiveness measures to better capture user preferences.