Why batch and user evaluations do not give the same results

Authors:
Andrew H. Turpin;William Hersh
Affiliations:
Curtin Univ. of Technology, Perth, WA, Australia;Oregon Health Sciences Univ., Portland, OR
Venue:
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2001

Citing 8
Cited 51

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Overview of the first TREC conference

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance and retrieval evaluation: perspectives from medicine

Journal of the American Society for Information Science - Special issue: relevance research
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

User interface effects in past batch versus user experiments

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Is seeing believing?: how recommender system interfaces affect users' opinions

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Using titles and category names from editor-driven taxonomies for automatic evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Evaluating collaborative filtering recommender systems

ACM Transactions on Information Systems (TOIS)
Do clarity scores for queries correlate with user performance?

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Social matching: A framework and research agenda

ACM Transactions on Computer-Human Interaction (TOCHI)
Challenges and resources for evaluating geographical IR

Proceedings of the 2005 workshop on Geographic information retrieval
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A model for quantitative evaluation of an end-to-end question-answering system

Journal of the American Society for Information Science and Technology
MarCol: A Market-Based Recommender System

IEEE Intelligent Systems
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
REFEREE: an open framework for practical testing of recommender systems using ResearchIndex

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
That's what friends are for: facilitating 'who knows what' across group boundaries

Proceedings of the 2007 international ACM conference on Supporting group work
Semantic components enhance retrieval of domain-specific documents

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A probability ranking principle for interactive information retrieval

Information Retrieval
How do users find things with PubMed?: towards automatic utility evaluation with user simulations

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Enhancing web search by promoting multiple search engine use

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The good and the bad system: does the test collection predict users' effectiveness?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
User adaptation: good results from poor systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance thresholds in system evaluations

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Some(what) grand challenges for information retrieval

ACM SIGIR Forum
Toward automatic facet analysis and need negotiation: Lessons from mediated search

ACM Transactions on Information Systems (TOIS)
On test collections for adaptive information retrieval

Information Processing and Management: an International Journal
Including summaries in system evaluation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Using semantic components to search for domain-specific documents: An evaluation from the system perspective and the user perspective

Information Systems
Using semantic components to search for domain-specific documents: An evaluation from the system perspective and the user perspective

Information Systems
Methods for Evaluating Interactive Information Retrieval Systems with Users

Foundations and Trends in Information Retrieval
Comparing fact finding tasks and user survey for evaluating a video browsing tool

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Metric and Relevance Mismatch in Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Test Collection-Based IR Evaluation Needs Extension toward Sessions --- A Case of Extremely Short Queries

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Effects of position and number of relevant documents retrieved on users' evaluations of system performance

ACM Transactions on Information Systems (TOIS)
Tightly coupled views for navigating content repositories

Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web
A review of factors influencing user satisfaction in information retrieval

Journal of the American Society for Information Science and Technology
Do user preferences and evaluation measures line up?

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Human performance and retrieval precision revisited

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating interfaces for government metasearch

Proceedings of the third symposium on Information interaction in context
Using query context models to construct topical search engines

Proceedings of the third symposium on Information interaction in context
On the potential search effectiveness of MeSH (medical subject headings) terms

Proceedings of the third symposium on Information interaction in context
A comparison of user and system query performance predictions

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
New measures for the evaluation of interactive information retrieval systems: normalized task completion time and normalized user effectiveness

Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
The effect of user characteristics on search effectiveness in information retrieval

Information Processing and Management: an International Journal
A study of the integration of passage-, document-, and cluster-based information for re-ranking search results

Information Retrieval
Designing a user interface for interactive retrieval of structured documents — lessons learned from the INEX interactive track

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Time-based calibration of effectiveness measures

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Time drives interaction: simulating sessions in diverse searching environments

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Amount of invested mental effort (AIME) in online searching

Information Processing and Management: an International Journal
How query cost affects search behavior

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation

Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation
Research paper recommender system evaluation: a quantitative literature survey

Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation
Choices in batch information retrieval evaluation

Proceedings of the 18th Australasian Document Computing Symposium
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much system-oriented evaluation of information retrieval systems has used the Cranfield approach based upon queries run against test collections in a batch mode. Some researchers have questioned whether this approach can be applied to the real world, but little data exists for or against that assertion. We have studied this question in the context of the TREC Interactive Track. Previous results demonstrated that improved performance as measured by relevance-based metrics in batch studies did not correspond with the results of outcomes based on real user searching tasks. The experiments in this paper analyzed those results to determine why this occurred. Our assessment showed that while the queries entered by real users into systems yielding better results in batch studies gave comparable gains in ranking of relevant documents for those users, they did not translate into better performance on specific tasks. This was most likely due to users being able to adequately find and utilize relevant documents ranked further down the output list.