Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics

Authors:
Hiroki Mori;Tomoyuki Satake;Makoto Nakamura;Hideki Kasuya
Affiliations:
Graduate School of Engineering, Utsunomiya University, 7-1-2, Yoto, Utsunomiya-shi 321-8585, Japan;Graduate School of Engineering, Utsunomiya University, 7-1-2, Yoto, Utsunomiya-shi 321-8585, Japan;Faculty of International Studies, Utsunomiya University, 350, Minemachi, Utsunomiya-shi 321-8505, Japan;Graduate School of Engineering, Utsunomiya University, 7-1-2, Yoto, Utsunomiya-shi 321-8585, Japan
Venue:
Speech Communication
Year:
2011

Citing 4
Cited 1

Describing the emotional states that are expressed in speech

Speech Communication - Special issue on speech and emotion
Emotional speech: towards a new generation of databases

Speech Communication - Special issue on speech and emotion
Japanese dialogue corpus of multi-level annotation

SIGDIAL '00 Proceedings of the 1st SIGdial workshop on Discourse and dialogue - Volume 10
Facial Expression Generation from Speaker's Emotional States in Daily Conversation

IEICE - Transactions on Information and Systems

Paralinguistics in speech and language-State-of-the-art and the challenge

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Utsunomiya University (UU) Spoken Dialogue Database for Paralinguistic Information Studies is introduced. The UU Database is especially intended for use in understanding the usage, structure and effect of paralinguistic information in expressive Japanese conversational speech. Paralinguistic information refers to meaningful information, such as emotion or attitude, delivered along with linguistic messages. The UU Database comes with labels of perceived emotional states for all utterances. The emotional states were annotated with six abstract dimensions: pleasant-unpleasant, aroused-sleepy, dominant-submissive, credible-doubtful, interested-indifferent, and positive-negative. To stimulate expressively-rich and vivid conversation, the ''4-frame cartoon sorting task'' was devised. In this task, four cards each containing one frame extracted from a cartoon are shuffled, and each participant with two cards out of the four then has to estimate the original order. The effectiveness of the method was supported by a broad distribution of subjective emotional state ratings. Preliminary annotation experiments by a large number of annotators confirmed that most annotators could provide fairly consistent ratings for a repeated identical stimulus, and the inter-rater agreement was good (W~0.5) for three of the six dimensions. Based on the results, three annotators were selected for labeling all 4840 utterances. The high degree of agreement was verified using such measures as Kendall's W. The results of correlation analyses showed that not only prosodic parameters such as intensity and f"0 but also a voice quality parameter were related to the dimensions. Multiple correlation of above 0.7 and RMS error of about 0.6 were obtained for the recognition of some dimensions using linear combinations of the speech parameters. Overall, the perceived emotional states of speakers can be accurately estimated from the speech parameters in most cases.