Automatic language and information processing: rethinking evaluation

Authors:
Karen Sparck Jones
Affiliations:
Computer Laboratory, University of Cambridge, New Museums Site, Pembroke Street, Cambridge CB2 3QG, UK/ e-mail: sparckjones@cl.cam.ac.uk
Venue:
Natural Language Engineering
Year:
2001

Citing 8
Cited 10

Finding information on the World Wide Web: the retrieval effectiveness of search engines

Information Processing and Management: an International Journal
Further reflections on TREC

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Tipster/MUC-5: information extraction system evaluation

MUC5 '93 Proceedings of the 5th conference on Message understanding
Design of the MUC-6 evaluation

MUC6 '95 Proceedings of the 6th conference on Message understanding
Overview of results of the MUC-6 evaluation

MUC6 '95 Proceedings of the 6th conference on Message understanding
Benchmark tests for the DARPA Spoken Language Program

HLT '93 Proceedings of the workshop on Human Language Technology
1993 benchmark tests for the ARPA spoken language program

HLT '94 Proceedings of the workshop on Human Language Technology
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998

Evaluating a content based image retrieval system

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating the evaluation: a case study using the TREC 2002 question answering track

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Generating and evaluating evaluative arguments

Artificial Intelligence
A model for quantitative evaluation of an end-to-end question-answering system

Journal of the American Society for Information Science and Technology
Assessing term effectiveness in the interactive information access process

Information Processing and Management: an International Journal
On test collections for adaptive information retrieval

Information Processing and Management: an International Journal
Generating and evaluating evaluative arguments

Artificial Intelligence
Automatic classification of medical reports, the CIREA project

TELE-INFO'06 Proceedings of the 5th WSEAS international conference on Telecommunications and informatics
Evaluation and NLP

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Evaluation and NLP

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

System evaluation has mattered since research on automatic language and information processing began. However, the (D)ARPA conferences have raised the stakes substantially in requiring and delivering systematic evaluations and in sustaining these through long term programmes; and it has been claimed that this has both significantly raised task performance, as defined by appropriate effectiveness measures, and promoted relevant engineering development. These controlled laboratory evaluations have made very strong assumptions about the task context. The paper examines these assumptions for six task areas, considers their impact on evaluation and performance results, and argues that for current tasks of interest, e.g. summarising, it is now essential to play down the present narrowly-defined performance measures in order to address the task context, and specifically the role of the human participant in the task, so that new measures, of larger value, can be developed and applied.