Predicting automatic speech recognition performance using prosodic cues

Authors:
Diane J. Litman;Julia B. Hirschberg;Marc Swerts
Affiliations:
AT&T Labs - Research Florham Park, NJ;AT&T Labs - Research Florham Park, NJ;Center for User-System Interaction, Eindhoven, The Netherlands
Venue:
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Year:
2000

Citing 9
Cited 13

Prosodic and lexical indications of discourse structure in human-machine interactions

Speech Communication
An evaluation of strategies for selectively verifying utterance meanings in spoken natural language dialog

International Journal of Human-Computer Studies
Empirically evaluating an adaptable spoken dialogue system

UM '99 Proceedings of the seventh international conference on User modeling
Computational models of the prosody/syntax mapping for spoken language systems

Computational models of the prosody/syntax mapping for spoken language systems
Characterizing and recognizing spoken corrections in human-computer dialogue

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Automatic detection of poor speech recognition at the dialogue level

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Decoding optimal state sequence with smooth state likelihoods

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Incorporating confidence measures in the Dutch train timetable information system developed in the ARISE project

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Learning trees and rules with set-valued features

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Designing and Evaluating an Adaptive Spoken Dialogue System

User Modeling and User-Adapted Interaction
Emotions, speech and the ASR framework

Speech Communication - Special issue on speech and emotion
Using Dialogue Features to Predict Trouble During Collaborative Learning

User Modeling and User-Adapted Interaction
Predicting user reactions to system error

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Identifying user corrections automatically in spoken dialogue systems

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Labeling corrections and aware sites in spoken dialogue systems

SIGDIAL '01 Proceedings of the Second SIGdial Workshop on Discourse and Dialogue - Volume 16
Exceptionality and natural language learning

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
ASR for emotional speech: Clarifying the issues and enhancing performance

Neural Networks - Special issue: Emotion and brain
Combining acoustic and pragmatic features to predict recognition performance in spoken dialogue systems

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
User simulations for context-sensitive speech recognition in spoken dialogue systems

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Response-based confidence annotation for spoken dialogue systems

SIGdial '08 Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue
Automatically training a problematic dialogue predictor for a spoken dialogue system

Journal of Artificial Intelligence Research
N-best rescoring based on pitch-accent patterns

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

In spoken dialogue systems, it is important for a system to know how likely a speech recognition hypothesis is to be correct, so it can reprompt for fresh input, or, in cases where many errors have occurred, change its interaction strategy or switch the caller to a human attendant. We have discovered prosodic features which more accurately predict when a recognition hypothesis contains a word error than the acoustic confidence score thresholds traditionally used in automatic speech recognition. We present analytic results indicating that there are significant prosodic differences between correctly and incorrectly recognized turns in the TOOT train information corpus. We then present machine learning results showing how the use of prosodic features to automatically predict correct versus incorrectly recognized turns improves over the use of acoustic confidence scores alone.