Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics

  • Authors:
  • Hiroki Mori;Tomoyuki Satake;Makoto Nakamura;Hideki Kasuya

  • Affiliations:
  • Graduate School of Engineering, Utsunomiya University, 7-1-2, Yoto, Utsunomiya-shi 321-8585, Japan;Graduate School of Engineering, Utsunomiya University, 7-1-2, Yoto, Utsunomiya-shi 321-8585, Japan;Faculty of International Studies, Utsunomiya University, 350, Minemachi, Utsunomiya-shi 321-8505, Japan;Graduate School of Engineering, Utsunomiya University, 7-1-2, Yoto, Utsunomiya-shi 321-8585, Japan

  • Venue:
  • Speech Communication
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Utsunomiya University (UU) Spoken Dialogue Database for Paralinguistic Information Studies is introduced. The UU Database is especially intended for use in understanding the usage, structure and effect of paralinguistic information in expressive Japanese conversational speech. Paralinguistic information refers to meaningful information, such as emotion or attitude, delivered along with linguistic messages. The UU Database comes with labels of perceived emotional states for all utterances. The emotional states were annotated with six abstract dimensions: pleasant-unpleasant, aroused-sleepy, dominant-submissive, credible-doubtful, interested-indifferent, and positive-negative. To stimulate expressively-rich and vivid conversation, the ''4-frame cartoon sorting task'' was devised. In this task, four cards each containing one frame extracted from a cartoon are shuffled, and each participant with two cards out of the four then has to estimate the original order. The effectiveness of the method was supported by a broad distribution of subjective emotional state ratings. Preliminary annotation experiments by a large number of annotators confirmed that most annotators could provide fairly consistent ratings for a repeated identical stimulus, and the inter-rater agreement was good (W~0.5) for three of the six dimensions. Based on the results, three annotators were selected for labeling all 4840 utterances. The high degree of agreement was verified using such measures as Kendall's W. The results of correlation analyses showed that not only prosodic parameters such as intensity and f"0 but also a voice quality parameter were related to the dimensions. Multiple correlation of above 0.7 and RMS error of about 0.6 were obtained for the recognition of some dimensions using linear combinations of the speech parameters. Overall, the perceived emotional states of speakers can be accurately estimated from the speech parameters in most cases.