A coupled HMM approach to video-realistic speech animation
Pattern Recognition
Local face sketch synthesis learning
Neurocomputing
Accumulated motion energy fields estimation and representation for semantic event detection
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation
SBIA '08 Proceedings of the 19th Brazilian Symposium on Artificial Intelligence: Advances in Artificial Intelligence
Realistic visual speech synthesis based on hybrid concatenation method
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Cultural Specific Effects on the Recognition of Basic Emotions: A Study on Italian Subjects
USAB '09 Proceedings of the 5th Symposium of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society on HCI and Usability for e-Inclusion
Statistical motion information extraction and representation for semantic video analysis
IEEE Transactions on Circuits and Systems for Video Technology
On speech and gestures synchrony
COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment
COST'11 Proceedings of the 2011 international conference on Cognitive Behavioural Systems
A multimodal probabilistic model for gesture--based control of sound synthesis
Proceedings of the 21st ACM international conference on Multimedia
Gesture--sound mapping by demonstration in interactive music systems
Proceedings of the 21st ACM international conference on Multimedia
Hi-index | 0.00 |
The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data.