Multimodal translation system using texture-mapped lip-sync images for video mail and automatic dubbing applications

Authors:
Shigeo Morishima;Satoshi Nakamura
Affiliations:
School of Science and Engineering, Waseda University, Tokyo, Japan and ATR Spoken Language Translation Research Laboratories, Kyoto, Japan;ATR Spoken Language Translation Research Laboratories, Kyoto, Japan
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2004

Citing 4
Cited 1

Face Analysis for the Synthesis of Photo-Realistic Talking Heads

FG '00 Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000
Feasibility study for ellipsis resolution in dialogues by machine-learning technique

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Experiments and prospects of Example-Based Machine Translation

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Incremental translation utilizing constituent boundary patterns

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1

Key-frame removal method for blendshape-based cartoon lip-sync animation

ACM SIGGRAPH 2006 Research posters

Quantified Score

Hi-index	0.01

Visualization

Abstract

We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.