Will pyramids built of nuggets topple over?

  • Authors:
  • Jimmy Lin;Dina Demner-Fushman

  • Affiliations:
  • University of Maryland, College Park, MD;University of Maryland, College Park, MD

  • Venue:
  • HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The present methodology for evaluating complex questions at TREC analyzes answers in terms of facts called "nuggets". The official F-score metric represents the harmonic mean between recall and precision at the nugget level. There is an implicit assumption that some facts are more important than others, which is implemented in a binary split between "vital" and "okay" nuggets. This distinction holds important implications for the TREC scoring model---essentially, systems only receive credit for retrieving vital nuggets---and is a source of evaluation instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scoring model based on judgments from multiple assessors that captures a more refined notion of nugget importance. We demonstrate on TREC 2003, 2004, and 2005 data that our "nugget pyramids" address many shortcomings of the present methodology, while introducing only minimal additional overhead on the evaluation flow.