Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
An empirical study of information synthesis tasks
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Using question series to evaluate question answering system effectiveness
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Automatically evaluating answers to definition questions
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Nuggeteer: automatic nugget-based evaluation using descriptions and judgements
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
User simulations for evaluating answers to question series
Information Processing and Management: an International Journal
Soft pattern matching models for definitional question answering
ACM Transactions on Information Systems (TOIS)
Answering Clinical Questions with Knowledge-Based and Statistical Techniques
Computational Linguistics
The role of information retrieval in answering complex questions
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Overview of the TREC 2006 ciQA task
ACM SIGIR Forum
Utility-based information distillation over temporally sequenced documents
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Deconstructing nuggets: the stability and reliability of complex question answering evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Intra-assessor consistency in question answering
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Open-domain question: answering
Foundations and Trends in Information Retrieval
User preference choices for complex question answering
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On the subjectivity of human-authored summaries*
Natural Language Engineering
Towards a computational treatment of superlatives
ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
Query-based opinion summarization for legal blog entries
Proceedings of the 12th International Conference on Artificial Intelligence and Law
An Effectiveness Measure for Ambiguous and Underspecified Queries
ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Using negative voting to diversify answers in non-factoid question answering
Proceedings of the 18th ACM conference on Information and knowledge management
Putting the user in the loop: interactive Maximal Marginal Relevance for query-focused summarization
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
On the evaluation of entity profiles
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Using graded-relevance metrics for evaluating community QA answer selection
Proceedings of the fourth ACM international conference on Web search and data mining
A query-based multi-document sentiment summarizer
Proceedings of the 20th ACM international conference on Information and knowledge management
IR system evaluation using nugget-based test collections
Proceedings of the fifth ACM international conference on Web search and data mining
Human question answering performance using an interactive document retrieval system
Proceedings of the 4th Information Interaction in Context Symposium
Constructing test collections by inferring document relevance via extracted relevant information
Proceedings of the 21st ACM international conference on Information and knowledge management
Contextual Language Models For Ranking Answers To Natural Language Definition Questions
Computational Intelligence
TextWiki: a superlative resource
Language Resources and Evaluation
Hi-index | 0.00 |
The present methodology for evaluating complex questions at TREC analyzes answers in terms of facts called "nuggets". The official F-score metric represents the harmonic mean between recall and precision at the nugget level. There is an implicit assumption that some facts are more important than others, which is implemented in a binary split between "vital" and "okay" nuggets. This distinction holds important implications for the TREC scoring model---essentially, systems only receive credit for retrieving vital nuggets---and is a source of evaluation instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scoring model based on judgments from multiple assessors that captures a more refined notion of nugget importance. We demonstrate on TREC 2003, 2004, and 2005 data that our "nugget pyramids" address many shortcomings of the present methodology, while introducing only minimal additional overhead on the evaluation flow.