Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system

Authors:
Myroslava O. Dzikovska;Peter Bell;Amy Isard;Johanna D. Moore
Affiliations:
University of Edinburgh, United Kingdom;University of Edinburgh, United Kingdom;University of Edinburgh, United Kingdom;University of Edinburgh, United Kingdom
Venue:
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 16
Cited 1

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Towards developing general models of usability with PARADISE

Natural Language Engineering
Modelling user satisfaction and student learning in a spoken dialogue tutoring system with generic, tutoring, and user affect parameters

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Predicting the quality and usability of spoken dialogue services

Speech Communication
Technical support dialog systems: issues, problems, and solutions

NAACL-HLT-Dialog '07 Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies
Comparing Linguistic Features for Modeling Learning in Computer Tutoring

Proceedings of the 2007 conference on Artificial Intelligence in Education: Building Technology Rich Learning Contexts That Work
Exploiting discourse structure for spoken dialogue performance analysis

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Understanding complex natural language explanations in tutorial applications

ScaNaLU '06 Proceedings of the Third Workshop on Scalable Natural Language Understanding
Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains Modalities

Proceedings of the 2009 conference on Artificial Intelligence in Education: Building Learning Systems that Care: From Knowledge Representation to Affective Modelling
The “DeMAND” coding scheme: A “common language” for representing and analyzing student discourse

Proceedings of the 2009 conference on Artificial Intelligence in Education: Building Learning Systems that Care: From Knowledge Representation to Affective Modelling
Dealing with interpretation errors in tutorial dialogue

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
"Ask not what textual entailment can do for you..."

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
BEETLE II: a system for tutoring and computational linguistics experimentation

ACLDemos '10 Proceedings of the ACL 2010 System Demonstrations
SemEval-2010 task 12: Parser evaluation using textual entailments

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Intelligent tutoring with natural language support in the BEETLE II system

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
The AT&T spoken language understanding system

IEEE Transactions on Audio, Speech, and Language Processing

Towards effective tutorial feedback for explanation questions: a dataset and baselines

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is not always clear how the differences in intrinsic evaluation metrics for a parser or classifier will affect the performance of the system that uses it. We investigate the relationship between the intrinsic evaluation scores of an interpretation component in a tutorial dialogue system and the learning outcomes in an experiment with human users. Following the PARADISE methodology, we use multiple linear regression to build predictive models of learning gain, an important objective outcome metric in tutorial dialogue. We show that standard intrinsic metrics such as F-score alone do not predict the outcomes well. However, we can build predictive performance functions that account for up to 50% of the variance in learning gain by combining features based on standard evaluation scores and on the confusion matrix entries. We argue that building such predictive models can help us better evaluate performance of NLP components that cannot be distinguished based on F-score alone, and illustrate our approach by comparing the current interpretation component in the system to a new classifier trained on the evaluation data.