Improvements that don't add up: ad-hoc retrieval results since 1998

Authors:
Timothy G. Armstrong;Alistair Moffat;William Webber;Justin Zobel
Affiliations:
The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 19
Cited 29

The significance of the Cranfield tests on index languages

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Improving automatic query expansion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Parsimonious language models for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A multi-system analysis of document and term selection for blind feedback

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance information: a loss of entropy but a gain for IDF?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Word sense disambiguation in queries

Proceedings of the 14th ACM international conference on Information and knowledge management
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Semantic term matching in axiomatic approaches to information retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Recognition and classification of noun phrases in queries for effective retrieval

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Has adhoc retrieval improved since 1994?

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Evaluating topic models for digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
A study of information retrieval weighting schemes for sentiment analysis

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Presenting query aspects to support exploratory search

AUIC '10 Proceedings of the Eleventh Australasian Conference on User Interface - Volume 106
IR between science and engineering, and the role of experimentation

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Automated component-level evaluation: present and future

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Concept-Based Information Retrieval Using Explicit Semantic Analysis

ACM Transactions on Information Systems (TOIS)
Latent semantic indexing (LSI) fails for TREC collections

ACM SIGKDD Explorations Newsletter
The limits of retrieval effectiveness

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Ad hoc IR: not much room for improvement

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A tool for comparative IR evaluation on component level

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Towards a taxonomy of syntactic and semantic matching mechanisms for aspect-oriented modeling

SAM'10 Proceedings of the 6th international conference on System analysis and modeling: about models
Comparative information retrieval evaluation for scanned documents

Proceedings of the 15th WSEAS international conference on Computers
Quantifying the impact of concept recognition on biomedical information retrieval

Information Processing and Management: an International Journal
Principles for robust evaluation infrastructure

Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
IR research: systems, interaction, evaluation and theories

ACM SIGIR Forum
Multiple testing in statistical analysis of systems-based information retrieval experiments

ACM Transactions on Information Systems (TOIS)
Salton award lecture: information retrieval as engineering science

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A hybrid model for ad-hoc information retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Ousting ivory tower research: towards a web framework for providing experiments as a service

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Experimental methods for information retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
DIRECTions: design and specification of an IR evaluation infrastructure

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
An attempt to measure the quality of questions in question time of the Australian Federal Parliament

Proceedings of the Seventeenth Australasian Document Computing Symposium
Panel on use of proprietary data

ACM SIGIR Forum
Salton award lecture information retrieval as engineering science

ACM SIGIR Forum
PROMISE retreat report prospects and opportunities for information access evaluation

ACM SIGIR Forum
Fuhr's challenge: conceptual research, or bust

ACM SIGIR Forum
Exploring the magic of WAND

Proceedings of the 18th Australasian Document Computing Symposium
A study of supervised term weighting scheme for sentiment analysis

Expert Systems with Applications: An International Journal
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The existence and use of standard test collections in information retrieval experimentation allows results to be compared between research groups and over time. Such comparisons, however, are rarely made. Most researchers only report results from their own experiments, a practice that allows lack of overall improvement to go unnoticed. In this paper, we analyze results achieved on the TREC Ad-Hoc, Web, Terabyte, and Robust collections as reported in SIGIR (1998--2008) and CIKM (2004--2008). Dozens of individual published experiments report effectiveness improvements, and often claim statistical significance. However, there is little evidence of improvement in ad-hoc retrieval technology over the past decade. Baselines are generally weak, often being below the median original TREC system. And in only a handful of experiments is the score of the best TREC automatic run exceeded. Given this finding, we question the value of achieving even a statistically significant result over a weak baseline. We propose that the community adopt a practice of regular longitudinal comparison to ensure measurable progress, or at least prevent the lack of it from going unnoticed. We describe an online database of retrieval runs that facilitates such a practice.