Automatic 3d virtual cloning of a speaking human face

Authors:
Alessandro Moro;Enzo Mumolo;Massimiliano Nolich
Affiliations:
University of Trieste, Trieste, Italy;University of Trieste, Trieste, Italy;IFACE s.r.l., Trieste, Italy
Venue:
Proceedings of the 2010 ACM workshop on Surreal media and virtual cloning
Year:
2010

Citing 6
Cited 0

Fundamentals of speech recognition

Fundamentals of speech recognition
Quantitative association of vocal-tract and facial behavior

Speech Communication - Special issue on auditory-visual speech processing
A Speech Driven Talking Head System Based on a Single Face Image

PG '99 Proceedings of the 7th Pacific Conference on Computer Graphics and Applications
Lip-Sync in Human Face Animation Based on Video Analysis and Spline Models

MMM '04 Proceedings of the 10th International Multimedia Modelling Conference
Embodied mobile agents

AAMAS '06 Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems
Humanoid Audio–Visual Avatar With Emotive Text-to-Speech Synthesis

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present an algorithm for automatically cloning a speaking human face for virtual reality applications and virtual worlds buildings. A person trains the algorithm by speaking; the algorithm learns the articulatory movements and generates synthetic speech which sounds somehow similar to the speech of the person who trained the system. Modeling the nonlinear linkage between articulatory movements and facial movements with a neural network, the algorithm generates the facial movements, synchronized with the artificial utterance, which would have been used during the generation of the utterance by a human being. Our algorithm is inspired to the mirror neuron theory of speech production, and learns the articulatory movements using a genetic optimization algorithm and a set of fuzzy rules. The algorithm reproduces an original utterance minimizing the mean squared error between synthetic and original utterances. Subjective listening tests of sentences artificially generated with our model resulted in an average phonetic accuracy of about 84%, and the naturalness of the generated face movements has been estimated with a score of 82%. Experimental results and a case study are reported.