Using prosody to improve automatic speech recognition

Authors:
Klára Vicsi;György Szaszák
Affiliations:
Laboratory of Speech Acoustics, Budapest University of Technology and Economics, TMIT, Stoczek u. 2, 1111 Budapest, Hungary;Laboratory of Speech Acoustics, Budapest University of Technology and Economics, TMIT, Stoczek u. 2, 1111 Budapest, Hungary
Venue:
Speech Communication
Year:
2010

Citing 4
Cited 4

Speech recognition: theory and C++ implementation

Speech recognition: theory and C++ implementation
Integrated recognition of words and prosodic phrase boundaries

Speech Communication - Dialogue and prosody
Prosody/parse scoring and its application in ATIS

HLT '93 Proceedings of the workshop on Human Language Technology
Using prosody in fixed stress languages for improvement of speech recognition

COST 2102'07 Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours

Designing a hungarian multimodal database - speech recording and annotation

Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues
Prosodic and temporal features for language modeling for dialog

Speech Communication
Automatic assessment of expressive oral reading

Speech Communication
A cross-cultural study on the perception of emotions: how hungarian subjects evaluate american and italian emotional expressions

COST'11 Proceedings of the 2011 international conference on Cognitive Behavioural Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. The proposed modelling approach is very similar to the one used in standard speech recognition, where basic HMM units (the most often acoustic phoneme models) are trained and are then connected according to the dictionary and some grammar (language model) to obtain a recognition network, along which recognition can be interpreted also as an alignment process. In this paper the HMM framework is used to model speech prosody, and to perform initial syntactic and/or semantic level processing of the input speech in parallel to standard speech recognition. As acoustic-prosodic features, fundamental frequency and energy are used. A method was implemented for syntactic level information extraction from the speech. The method was designed to work for fixed-stress languages, and it yields a segmentation of the input speech for syntactically linked word groups, or even single words corresponding to a syntactic unit (these word groups are sometimes referred to as phonological phrases in psycholinguistics, which can consist of one or more words). These so-called word-stress units are marked by prosody, and have an associated fundamental frequency and/or energy contour which allows their discovery. For this, HMMs for the different types of word-stress unit contours were trained and then used for recognition and alignment of such units from the input speech. This prosodic segmentation of the input speech also allows word-boundary recovery and can be used for N-best lattice rescoring based on prosodic information. The syntactic level input speech segmentation algorithm was evaluated for the Hungarian and for the Finnish languages that have fixed stress on the first syllable. (This means if a word is stressed, stress is realized on the first syllable of the word.) The N-best rescoring based on syntactic level word-stress unit alignment was shown to augment the number of correctly recognized words. For further syntactic and semantic level processing of the input speech in ASR, clause and sentence boundary detection and modality (sentence type) recognition was implemented. Again, the classification was carried out by HMMs, which model the prosodic contour for each clause and/or sentence modality type. Clause (and hence also sentence) boundary detection was based on HMM's excellent capacity in aligning dynamically the reference prosodic structure to the utterance coming from the ASR input. This method also allows punctuation to be automatically marked. This semantic level processing of speech was investigated for the Hungarian and for the German languages. The correctness of recognized types of modalities was 69% for Hungarian, and 78% for German.