Mapping from speech to images using continuous state space models

Authors:
Tue Lehn-Schiøler;Lars Kai Hansen;Jan Larsen
Affiliations:
Informatics and Mathematical Modelling, The Technical University of Denmark;Informatics and Mathematical Modelling, The Technical University of Denmark;Informatics and Mathematical Modelling, The Technical University of Denmark
Venue:
MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
Year:
2004

Citing 8
Cited 4

Video Rewrite: driving visual speech with audio

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Speaker independence in automated lip-sync for audio-video communication

Computer Networks and ISDN Systems - Special issue on graphics research and education on the World Wide Web
Voice puppetry

Proceedings of the 26th annual conference on Computer graphics and interactive techniques
Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Active Appearance Models

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
MikeTalk: A Talking Facial Display Based on Morphing Visemes

CA '98 Proceedings of the Computer Animation
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
An HMM-based speech-to-video synthesizer

IEEE Transactions on Neural Networks

State-Space Models: From the EM Algorithm to a Gradient Approach

Neural Computation
Learning active appearance models from image sequences

VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Learning AAM fitting through simulation

Pattern Recognition
Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model

International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper a system that transforms speech waveforms to animated faces are proposed. The system relies on continuous state space models to perform the mapping, this makes it possible to ensure video with no sudden jumps and allows continuous control of the parameters in 'face space'. The performance of the system is critically dependent on the number of hidden variables, with too few variables the model cannot represent data, and with too many overfitting is noticed Simulations are performed on recordings of 3-5 sec. video sequences with sentences from the Timit database. From a subjective point of view the model is able to construct an image sequence from an unknown noisy speech sequence even though the number of training examples are limited.