Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation

Authors:
Mehmet E. Sargin;Yucel Yemez;Engin Erzin;Ahmet M. Tekalp
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2008

Citing 0
Cited 7

Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings

IEEE Transactions on Audio, Speech, and Language Processing
Gesture controllers

ACM SIGGRAPH 2010 papers
On the importance of eye gaze in a face-to-face collaborative task

Proceedings of the 3rd international workshop on Affective interaction in natural environments
Speech, gaze and head motion in a face-to-face collaborative task

Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues
How to train your avatar: a data driven approach to gesture generation

IVA'11 Proceedings of the 10th international conference on Intelligent virtual agents
Non-rigid 3D shape tracking from multiview video

Computer Vision and Image Understanding
Guest Editorial: Gesture and speech in interaction: An overview

Speech Communication

Quantified Score

Hi-index	0.14

Visualization

Abstract

We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in ``prosody transplant" and ``gesture transplant" scenarios.