Evaluating semantic evaluations: how RTE measures up

Authors:
Sam Bayer;John Burger;Lisa Ferro;John Henderson;Lynette Hirschman;Alex Yeh
Affiliations:
The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA
Venue:
MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment
Year:
2005

Citing 16
Cited 0

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
MiTAP for real users, real data, real problems

CHI '03 Extended Abstracts on Human Factors in Computing Systems
A systematic comparison of various statistical alignment models

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Analyses for elucidating current question answering technology

Natural Language Engineering
Deep Read: a reading comprehension system

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Gene name identification and normalization using a model organism database

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Design of the MUC-6 evaluation

MUC6 '95 Proceedings of the 6th conference on Message understanding
Overview of results of the MUC-6 evaluation

MUC6 '95 Proceedings of the 6th conference on Message understanding
Multi-site data collection and evaluation in spoken language understanding

HLT '93 Proceedings of the workshop on Human Language Technology
Reading comprehension tests for computer-based understanding evaluation

Natural Language Engineering
Corpus-based comprehensive and diagnostic MT evaluation: initial Arabic, Chinese, French, and Spanish results

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Generating an entailment corpus from news headlines

EMSEE '05 Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment
The PASCAL recognising textual entailment challenge

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we discuss paradigms for evaluating open-domain semantic interpretation as they apply to the PASCAL Recognizing Textual Entailment (RTE) evaluation (Dagan et al. 2005). We focus on three aspects critical to a successful evaluation: creation of large quantities of reasonably good training data, analysis of inter-annotator agreement, and joint analysis of test item difficulty and test-taker proficiency (Rasch analysis). We found that although RTE does not correspond to a “real” or naturally occurring language processing task, it nonetheless provides clear and simple metrics, a tolerable cost of corpus development, good annotator reliability (with the potential to exploit the remaining variability), and the possibility of finding noisy but plentiful training material.