On the impact of children's emotional speech on acoustic and language models

Authors:
Stefan Steidl;Anton Batliner;Dino Seppi;Björn Schuller
Affiliations:
Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany;Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany;ESAT, Katholieke Universiteit Leuven, Heverlee, Leuven, Belgium;Institute for Human-Machine Communication, Technische Universität München, München, Germany
Venue:
EURASIP Journal on Audio, Speech, and Music Processing - Special issue on atypical speech
Year:
2010

Citing 14
Cited 5

M = syntax + prosody: a syntactic prosodic labelling scheme for large spontaneous speech databases

Speech Communication
On Comparing Classifiers: Pitfalls toAvoid and a Recommended Approach

Data Mining and Knowledge Discovery
How to find trouble in communication

Speech Communication - Special issue on speech and emotion
Analysis and compensation of stressed and noisy speech with application to robust automatic recognition

Analysis and compensation of stressed and noisy speech with application to robust automatic recognition
ASR for emotional speech: Clarifying the issues and enhancing performance

Neural Networks - Special issue: Emotion and brain
A study of speech recognition for children and the elderly

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Speech under stress conditions: overview of the effect on speech production and on system performance

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 04
Highly accurate children's speech recognition for interactive reading tutors using subword units

Speech Communication
A Nonlinear Mapping for Data Structure Analysis

IEEE Transactions on Computers
Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech

User Modeling and User-Adapted Interaction
Emotion recognition from speech: Putting ASR in the loop

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach

Advances in Human-Computer Interaction - Special issue on emotion-aware natural interaction
Visualization of voice disorders using the sammon transform

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue

Determination of nonprototypical valence and arousal in popular music: features and performances

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on scalable audio-content analysis
Affective speaker state analysis in the presence of reverberation

International Journal of Speech Technology
Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario

ACM Transactions on Speech and Language Processing (TSLP)
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
Paralinguistics in speech and language-State-of-the-art and the challenge

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automatic recognition of children's speech is well known to be a challenge, and so is the influence of affect that is believed to downgrade performance of a speech recogniser. In this contribution, we investigate the combination of both phenomena. Extensive test runs are carried out for 1 k vocabulary continuous speech recognition on spontaneous motherese, emphatic, and angrychildren's speech as opposed to neutralspeech. The experiments address the question how specific emotions influence word accuracy. In a first scenario, "emotional" speech recognisers are compared to a speech recogniser trained on neutralspeech only. For this comparison, equal amounts of training data are used for each emotion-related state. In a second scenario, a "neutral" speech recogniser trained on large amounts of neutralspeech is adapted by adding only some emotionally coloured data in the training process. The results show that emphaticand angryspeech is recognised best--even better than neutralspeech--and that the performance can be improved further by adaptation of the acoustic and linguistic models. In order to show the variability of emotional speech, we visualise the distribution of the four emotion-related states in the MFCC space by applying a Sammon transformation.