Learning dynamic audio-visual mapping with input-output Hidden Markov models

Authors:
Yan Li;Heung-Yeung Shum
Affiliations:
-;-
Venue:
IEEE Transactions on Multimedia
Year:
2006

Citing 0
Cited 5

Real-time prosody-driven synthesis of body language

ACM SIGGRAPH Asia 2009 papers
Realistic visual speech synthesis based on hybrid concatenation method

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Gesture controllers

ACM SIGGRAPH 2010 papers
An input-output hidden Markov model for tree transductions

Neurocomputing
Virtual character performance from speech

Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we formulate the problem of synthesizing facial animation from an input audio sequence as a dynamic audio-visual mapping. We propose that audio-visual mapping should be modeled with an input-output hidden Markov model, or IOHMM. An IOHMM is an HMM for which the output and transition probabilities are conditional on the input sequence. We train IOHMMs using the expectation-maximization(EM) algorithm with a novel architecture to explicitly model the relationship between transition probabilities and the input using neural networks. Given an input sequence, the output sequence is synthesized by the maximum likelihood estimation. Experimental results demonstrate that IOHMMs can generate natural and good-quality facial animation sequences from the input audio.