Speech recognition: theory and C++ implementation
Speech recognition: theory and C++ implementation
Integrated recognition of words and prosodic phrase boundaries
Speech Communication - Dialogue and prosody
Prosody/parse scoring and its application in ATIS
HLT '93 Proceedings of the workshop on Human Language Technology
Using prosody in fixed stress languages for improvement of speech recognition
COST 2102'07 Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours
Designing a hungarian multimodal database - speech recording and annotation
Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues
Prosodic and temporal features for language modeling for dialog
Speech Communication
Automatic assessment of expressive oral reading
Speech Communication
COST'11 Proceedings of the 2011 international conference on Cognitive Behavioural Systems
Hi-index | 0.00 |
In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. The proposed modelling approach is very similar to the one used in standard speech recognition, where basic HMM units (the most often acoustic phoneme models) are trained and are then connected according to the dictionary and some grammar (language model) to obtain a recognition network, along which recognition can be interpreted also as an alignment process. In this paper the HMM framework is used to model speech prosody, and to perform initial syntactic and/or semantic level processing of the input speech in parallel to standard speech recognition. As acoustic-prosodic features, fundamental frequency and energy are used. A method was implemented for syntactic level information extraction from the speech. The method was designed to work for fixed-stress languages, and it yields a segmentation of the input speech for syntactically linked word groups, or even single words corresponding to a syntactic unit (these word groups are sometimes referred to as phonological phrases in psycholinguistics, which can consist of one or more words). These so-called word-stress units are marked by prosody, and have an associated fundamental frequency and/or energy contour which allows their discovery. For this, HMMs for the different types of word-stress unit contours were trained and then used for recognition and alignment of such units from the input speech. This prosodic segmentation of the input speech also allows word-boundary recovery and can be used for N-best lattice rescoring based on prosodic information. The syntactic level input speech segmentation algorithm was evaluated for the Hungarian and for the Finnish languages that have fixed stress on the first syllable. (This means if a word is stressed, stress is realized on the first syllable of the word.) The N-best rescoring based on syntactic level word-stress unit alignment was shown to augment the number of correctly recognized words. For further syntactic and semantic level processing of the input speech in ASR, clause and sentence boundary detection and modality (sentence type) recognition was implemented. Again, the classification was carried out by HMMs, which model the prosodic contour for each clause and/or sentence modality type. Clause (and hence also sentence) boundary detection was based on HMM's excellent capacity in aligning dynamically the reference prosodic structure to the utterance coming from the ASR input. This method also allows punctuation to be automatically marked. This semantic level processing of speech was investigated for the Hungarian and for the German languages. The correctness of recognized types of modalities was 69% for Hungarian, and 78% for German.