Predicting Speaker Head Nods and the Effects of Affective Information

Authors:
Jina Lee;S. C. Marsella
Affiliations:
Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA;-
Venue:
IEEE Transactions on Multimedia
Year:
2010

Citing 0
Cited 3

Towards more comprehensive listening behavior: beyond the bobble head

IVA'11 Proceedings of the 10th international conference on Intelligent virtual agents
Modeling speaker behavior: a comparison of two approaches

IVA'12 Proceedings of the 12th international conference on Intelligent Virtual Agents
Virtual character performance from speech

Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation

Quantified Score

Hi-index	0.00

Visualization

Abstract

During face-to-face conversation, our body is continually in motion, displaying various head, gesture, and posture movements. Based on findings describing the communicative functions served by these nonverbal behaviors, many virtual agent systems have modeled them to make the virtual agent look more effective and believable. One channel of nonverbal behaviors that has received less attention is head movements, despite the important functions served by them. The goal for this work is to build a domain-independent model of speaker's head movements that could be used to generate head movements for virtual agents. In this paper, we present a machine learning approach for learning models of head movements by focusing on when speaker head nods should occur, and conduct evaluation studies that compare the nods generated by this work to our previous approach of using handcrafted rules . To learn patterns of speaker head nods, we use a gesture corpus and rely on the linguistic and affective features of the utterance. We describe the feature selection process and training process for learning hidden Markov models and compare the results of the learned models under varying conditions. The results show that we can predict speaker head nods with high precision (.84) and recall (.89) rates, even without a deep representation of the surface text and that using affective information can help improve the prediction of the head nods (precision: .89, recall: .90). The evaluation study shows that the nods generated by the machine learning approach are perceived to be more natural in terms of nod timing than the nods generated by the rule-based approach.