Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis

Authors:
Miloš Železný;Zdeněk Krňoul;Petr Císař;Jindřich Matoušek
Affiliations:
University of West Bohemia in Pilsen, Plzeň, Czech Republic;University of West Bohemia in Pilsen, Plzeň, Czech Republic;University of West Bohemia in Pilsen, Plzeň, Czech Republic;University of West Bohemia in Pilsen, Plzeň, Czech Republic
Venue:
Signal Processing - Special section: Multimodal human-computer interfaces
Year:
2006

Citing 8
Cited 5

Realistic modeling for facial animation

SIGGRAPH '95 Proceedings of the 22nd annual conference on Computer graphics and interactive techniques
Reading between the lines—a method for extracting dynamic 3D with texture

VRST '97 Proceedings of the ACM symposium on Virtual reality software and technology
Automatic Creation of 3D Facial Models

IEEE Computer Graphics and Applications
Using Dirichlet Free Form Deformation to Fit Deformable Models to Noisy 3-D Data

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part II
Automatic 3D Cloning and Real-Time Animation of a Human Face

CA '97 Proceedings of the Computer Animation
Dirichlet Free-Form Deformations and their Application to Hand Simulation

CA '97 Proceedings of the Computer Animation
Automated Modelling of Real Human Faces for 3D Animation

ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1
Reconstruction of Realistic 3D Surface Model and 3D Animation from Range Images Obtained by Real Time 3D Measurement System

ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4

SynFace: speech-driven facial animation for virtual speech-reading support

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation
Listening-test-based annotation of communicative functions for expressive speech synthesis

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Czech senior COMPANION: wizard of Oz data collection and expressive speech corpus recording and annotation

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

Speech Communication
Multimodal synthesizer for russian and czech sign languages and audio-visual speech

UAHCI'13 Proceedings of the 7th international conference on Universal Access in Human-Computer Interaction: design methods, tools, and interaction techniques for eInclusion - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the whole process of creation of audio-visual speech synthesis system. Such system consists of two main parts, the acoustic synthesis emulating human speech and the facial animation emulating the human lip articulation. The acoustic subsystem is based on concatenation-based speech synthesis. The visual subsystem is designed as a realistic, fully three-dimensional parametrically controllable facial animation model. To be able to parametrically control the animation to emulate human articulation, the set of visual parameters has to be obtained for all basic speech units. To provide realistic animation, the database of lip movements of a real person need to be recorded and expressed by suitable parameterization. The set of control parameters for visual animation is then derived from this database. The 3D model of a head based on a head of a real person also makes the animation more realistic. To obtain such model, a 3D scanning of a real person has to be adopted. We present the design and implementation of above-mentioned process. The aim is to obtain realistic audio-visual speech synthesis with possibility to change the 3D head model according to particular person. The design, acquisition and processing of audio-visual speech corpus for such purpose is presented. Next, the process of both acoustic and visual speech synthesis is described. The visual speech synthesis comprises the tasks of model training, animation control, and co-articulation modelling. A facial animation can also increase intelligibility of a telephone speech to people with hearing disabilities. In such case the textual information to control the animation is not available. Solution to the problem of mapping visual parameters from speech signal either directly or through recognized text is presented. Furthermore, the 3D scanning algorithm is presented. It allows to obtain realistic 3D model based on a head of a real person and thus to personalize the talking head. In the end of this paper, evaluation of intelligibility of the presented audio-visual speech synthesis and its possible applications are presented.