Will pyramids built of nuggets topple over?

Authors:
Jimmy Lin;Dina Demner-Fushman
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Year:
2006

Citing 5
Cited 26

Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
An empirical study of information synthesis tasks

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Using question series to evaluate question answering system effectiveness

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Automatically evaluating answers to definition questions

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

Nuggeteer: automatic nugget-based evaluation using descriptions and judgements

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
User simulations for evaluating answers to question series

Information Processing and Management: an International Journal
Soft pattern matching models for definitional question answering

ACM Transactions on Information Systems (TOIS)
Answering Clinical Questions with Knowledge-Based and Statistical Techniques

Computational Linguistics
The role of information retrieval in answering complex questions

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Overview of the TREC 2006 ciQA task

ACM SIGIR Forum
Utility-based information distillation over temporally sequenced documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Deconstructing nuggets: the stability and reliability of complex question answering evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Intra-assessor consistency in question answering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Open-domain question: answering

Foundations and Trends in Information Retrieval
User preference choices for complex question answering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On the subjectivity of human-authored summaries*

Natural Language Engineering
Towards a computational treatment of superlatives

ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
Query-based opinion summarization for legal blog entries

Proceedings of the 12th International Conference on Artificial Intelligence and Law
An Effectiveness Measure for Ambiguous and Underspecified Queries

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Using negative voting to diversify answers in non-factoid question answering

Proceedings of the 18th ACM conference on Information and knowledge management
Putting the user in the loop: interactive Maximal Marginal Relevance for query-focused summarization

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Overview of the 2009 QA track: towards a common task for QA, focused IR and automatic summarization systems

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
On the evaluation of entity profiles

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Using graded-relevance metrics for evaluating community QA answer selection

Proceedings of the fourth ACM international conference on Web search and data mining
A query-based multi-document sentiment summarizer

Proceedings of the 20th ACM international conference on Information and knowledge management
IR system evaluation using nugget-based test collections

Proceedings of the fifth ACM international conference on Web search and data mining
Human question answering performance using an interactive document retrieval system

Proceedings of the 4th Information Interaction in Context Symposium
Constructing test collections by inferring document relevance via extracted relevant information

Proceedings of the 21st ACM international conference on Information and knowledge management
Contextual Language Models For Ranking Answers To Natural Language Definition Questions

Computational Intelligence
TextWiki: a superlative resource

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The present methodology for evaluating complex questions at TREC analyzes answers in terms of facts called "nuggets". The official F-score metric represents the harmonic mean between recall and precision at the nugget level. There is an implicit assumption that some facts are more important than others, which is implemented in a binary split between "vital" and "okay" nuggets. This distinction holds important implications for the TREC scoring model---essentially, systems only receive credit for retrieving vital nuggets---and is a source of evaluation instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scoring model based on judgments from multiple assessors that captures a more refined notion of nugget importance. We demonstrate on TREC 2003, 2004, and 2005 data that our "nugget pyramids" address many shortcomings of the present methodology, while introducing only minimal additional overhead on the evaluation flow.