Including summaries in system evaluation

Authors:
Andrew Turpin;Falk Scholer;Kalvero Jarvelin;Mingfang Wu;J. Shane Culpepper
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;University of Tampere, Tampere, Finland;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 24
Cited 21

Optimizing convenient online access to bibliographic databases

Information Services and Use
The Cranfield tests on index language devices

Readings in information retrieval
Advantages of query biased summaries in information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Why batch and user evaluations do not give the same results

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Searcher performance in question answering

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Liberal relevance criteria of TREC -: counting on negligible documents?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Query association for effective retrieval

Proceedings of the eleventh international conference on Information and knowledge management
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Forming test collections with no system pooling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
When will information retrieval be "good enough"?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The Turn: Integration of Information Seeking and Retrieval in Context (The Information Retrieval Series)

The Turn: Integration of Information Seeking and Retrieval in Context (The Information Retrieval Series)
Incremental test collections

Proceedings of the 14th ACM international conference on Information and knowledge management
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Problems with Kendall's tau

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The good and the bad system: does the test collection predict users' effectiveness?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Semantically driven snippet selection for supporting focused web searches

Data & Knowledge Engineering
Handbook of Parametric and Nonparametric Statistical Procedures

Handbook of Parametric and Nonparametric Statistical Procedures

Interactive relevance feedback with graded relevance and sentence extraction: simulated user experiments

Proceedings of the 18th ACM conference on Information and knowledge management
Evaluating whole-page relevance

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context

Proceedings of the third symposium on Information interaction in context
Presenting query aspects to support exploratory search

AUIC '10 Proceedings of the Eleventh Australasian Conference on User Interface - Volume 106
Expected browsing utility for web search evaluation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Evaluating search engines by clickthrough data

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part II
Simulating simple and fallible relevance feedback

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Search snippet evaluation at yandex: lessons learned and future directions

CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
Click the search button and be happy: evaluating direct and immediate information access

Proceedings of the 20th ACM international conference on Information and knowledge management
A noise-aware click model for web search

Proceedings of the fifth ACM international conference on Web search and data mining
Time-based calibration of effectiveness measures

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Summarizing highly structured documents for effective search interaction

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Model for simulating result document browsing in focused retrieval

Proceedings of the 4th Information Interaction in Context Symposium
Stochastic simulation of time-biased gain

Proceedings of the 21st ACM international conference on Information and knowledge management
Model Based Comparison of Discounted Cumulative Gain and Average Precision

Journal of Discrete Algorithms
Summaries, ranked retrieval and sessions: a unified framework for information access evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Click model-based information retrieval metrics

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Modeling behavioral factors ininteractive information retrieval

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Choices in batch information retrieval evaluation

Proceedings of the 18th Australasian Document Computing Symposium
Exploiting user disagreement for web search evaluation: an experimental approach

Proceedings of the 7th ACM international conference on Web search and data mining
Report on the SIGIR 2013 workshop on modeling user behavior for information retrieval evaluation (MUBE 2013)

ACM SIGIR Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

In batch evaluation of retrieval systems, performance is calculated based on predetermined relevance judgements applied to a list of documents returned by the system for a query. This evaluation paradigm, however, ignores the current standard operation of search systems which require the user to view summaries of documents prior to reading the documents themselves. In this paper we modify the popular IR metrics MAP and P@10 to incorporate the summary reading step of the search process, and study the effects on system rankings using TREC data. Based on a user study, we establish likely disagreements between relevance judgements of summaries and of documents, and use these values to seed simulations of summary relevance in the TREC data. Re-evaluating the runs submitted to the TREC Web Track, we find the average correlation between system rankings and the original TREC rankings is 0.8 (Kendall τ), which is lower than commonly accepted for system orderings to be considered equivalent. The system that has the highest MAP in TREC generally remains amongst the highest MAP systems when summaries are taken into account, but other systems become equivalent to the top ranked system depending on the simulated summary relevance. Given that system orderings alter when summaries are taken into account, the small amount of effort required to judge summaries in addition to documents (19 seconds vs 88 seconds on average in our data) should be undertaken when constructing test collections.