Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data: Research Articles

Authors:
Jiyong Ma;Ronald Cole;Bryan Pellom;Wayne Ward;Barbara Wise
Affiliations:
-;-;-;-;-
Venue:
Computer Animation and Virtual Worlds
Year:
2004

Citing 0
Cited 7

Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data

IEEE Transactions on Visualization and Computer Graphics
Animating blendshape faces by cross-mapping motion capture data

I3D '06 Proceedings of the 2006 symposium on Interactive 3D graphics and games
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces

IEEE Transactions on Visualization and Computer Graphics
3D Audiovisual Rendering and Real-Time Interactive Control of Expressivity in a Talking Head

IVA '07 Proceedings of the 7th international conference on Intelligent Virtual Agents
My science tutor: A conversational multimedia virtual tutor for elementary school science

ACM Transactions on Speech and Language Processing (TSLP)
Phoneme-level articulatory animation in pronunciation training

Speech Communication
Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

The Visual Computer: International Journal of Computer Graphics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a technique for accurate automatic visible speech synthesis from textual input. When provided with a speech waveform and the text of a spoken sentence, the system produces accurate visible speech synchronized with the audio signal. To develop the system, we collected motion capture data from a speaker's face during production of a set of words containing all diviseme sequences in English. The motion capture points from the speaker's face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a new utterance, the system locates the required sequence of divisemes, shrinks or expands each diviseme based on the desired phoneme segment durations in the target utterance, then moves the polygons in the regions of the lips and lower face to correspond to the spatial coordinates of the motion capture data. The motion mapping is realized by a key-shape mapping function learned by a set of viseme examples in the source and target faces. A well-posed numerical algorithm estimates the shape blending coefficients. Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are also developed to provide the final concatenative motion sequence. Copyright © 2004 John Wiley & Sons, Ltd.