Towards using prosody in speech recognition/understanding systems: differences between read and spontaneous speech

Authors:
Kim E. A. Silverman;Eleonora Blaauw;Judith Spitz;John F. Pitrelli
Affiliations:
NYNEX Science and Technology, White Plains, NY;NYNEX Science and Technology, White Plains, NY;NYNEX Science and Technology, White Plains, NY;NYNEX Science and Technology, White Plains, NY
Venue:
HLT '91 Proceedings of the workshop on Speech and Natural Language
Year:
1992

Citing 1
Cited 2

Collection and analysis of data from real users: implications for speech recognition/understanding systems

HLT '91 Proceedings of the workshop on Speech and Natural Language

Empirical studies on the disambiguation of cue phrases

Computational Linguistics
An analysis of prosodic information for the recognition of dialogue acts in a multimodal corpus in Mexican Spanish

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

A persistent problem for keyword-driven speech recognition systems is that users often embed the to-be-recognized words or phrases in longer utterances. The recognizer needs to locate the relevant sections of the speech signal and ignore extraneous words. Prosody might provide an extra source of information to help locate target words embedded in other speech. In this paper we examine some prosodic characteristics of 160 such utterances and compare matched read and spontaneous versions. Half of the utterances are from a corpus of spontaneous answers to requests for the name of a city, recorded from calls to Directory Assistance Operators. The other half are the same word strings read by volunteers attempting to model the real dialogue. Results show a consistent pattern across both sets of data: embedded city names almost always bear nuclear pitch accents and are in their own intonational phrases. However the distributions of tonal make-up of these prosodic features differ markedly in read versus spontaneous speech, implying that if algorithms that exploit these prosodic regularities are trained on read speech, then the probabilities are likely to be incorrect models of real user speech.