Comparing MUCK-II and MUC-3: assessing the difficulty of different tasks

Authors:
Lynette Hirschman
Affiliations:
Spoken Language Systems Group, Cambridge, MA
Venue:
MUC3 '91 Proceedings of the 3rd conference on Message understanding
Year:
1991

Citing 0
Cited 7

The State of the Art in Text Filtering

User Modeling and User-Adapted Interaction
Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3)

Computational Linguistics
Tipster/MUC-5: information extraction system evaluation

MUC5 '93 Proceedings of the 5th conference on Message understanding
Overview of the fourth message understanding evaluation and conference

MUC4 '92 Proceedings of the 4th conference on Message understanding
Survey of the Message Understanding Conferences

HLT '93 Proceedings of the workshop on Human Language Technology
Tipster/MUC-5 information extraction system evaluation

TIPSTER '93 Proceedings of a workshop on held at Fredericksburg, Virginia: September 19-23, 1993
Parsing run amok: relation-driven control for text analysis

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The natural language community has made impressive progress in evaluation over the last four years. However, as the evaluations become more sophisticated and more ambitious, a fundamental problem emerges: how to compare results across changing evaluation paradigms. When we change domain, task, and scoring procedures, as has been the case from MUCK-I to MUCK-II to MUC-3, we lose comparability of results. This makes it difficult to determine whether the field has made progress since the last evaluation. Part of the success of the MUC conferences has been due to the incremental approach taken to system evaluation. Over the four year period of the three conferences, the domain has become more "realistic", the task has become more ambitious and specified in much greater detail, and the scoring procedures have evolved to provide a largely automated scoring mechanism. This process has been critical to demonstrating the utility of the overall evaluation process. However we still need some way to assess overall progress of the field, and thus we need to compare results and task difficulty of MUC-3 relative to MUCK-II.