Audio/visual mapping with cross-modal hidden Markov models

Authors:
Shengli Fu;R. Gutierrez-Osuna;A. Esposito;P. K. Kakumanu;O. N. Garcia
Affiliations:
Dept. of Electr. & Comput. Eng., Univ. of Delaware, Newark, DE, USA;-;-;-;-
Venue:
IEEE Transactions on Multimedia
Year:
2005

Citing 0
Cited 12

A coupled HMM approach to video-realistic speech animation

Pattern Recognition
Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment

Speech Communication
Local face sketch synthesis learning

Neurocomputing
Accumulated motion energy fields estimation and representation for semantic event detection

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation

SBIA '08 Proceedings of the 19th Brazilian Symposium on Artificial Intelligence: Advances in Artificial Intelligence
Realistic visual speech synthesis based on hybrid concatenation method

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Cultural Specific Effects on the Recognition of Basic Emotions: A Study on Italian Subjects

USAB '09 Proceedings of the 5th Symposium of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society on HCI and Usability for e-Inclusion
Statistical motion information extraction and representation for semantic video analysis

IEEE Transactions on Circuits and Systems for Video Technology
On speech and gestures synchrony

COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment
A cross-cultural study on the perception of emotions: how hungarian subjects evaluate american and italian emotional expressions

COST'11 Proceedings of the 2011 international conference on Cognitive Behavioural Systems
A multimodal probabilistic model for gesture--based control of sound synthesis

Proceedings of the 21st ACM international conference on Multimedia
Gesture--sound mapping by demonstration in interactive music systems

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data.