Toward a task-based gold standard for evaluation of NP chunks and technical terms

  • Authors:
  • Nina Wacholder;Peng Song

  • Affiliations:
  • Rutgers University;Rutgers University

  • Venue:
  • NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a gold standard for evaluating two types of information extraction output -- noun phrase (NP) chunks (Abney 1991; Ramshaw and Marcus 1995) and technical terms (Justeson and Katz 1995; Daille 2000; Jacquemin 2002). The gold standard is built around the notion that since different semantic and syntactic variants of terms are arguably correct, a fully satisfactory assessment of the quality of the output must include task-based evaluation. We conducted an experiment that assessed subjects' choice of index terms in an information access task. Subjects showed significant preference for index terms that are longer, as measured by number of words, and more complex, as measured by number of prepositions. These terms, which were identified by a human indexer, serve as the gold standard. The experimental protocol is a reliable and rigorous method for evaluating the quality of a set of terms. An important advantage of this task-based evaluation is that a set of index terms which is different than the gold standard can 'win' by providing better information access than the gold standard itself does. And although the individual human subject experiments are time consuming, the experimental interface, test materials and data analysis programs are completely re-usable.