Face Analysis for the Synthesis of Photo-Realistic Talking Heads

  • Authors:
  • Hans Peter Graf;Eric Cosatto;Tony Ezzat

  • Affiliations:
  • -;-;-

  • Venue:
  • FG '00 Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes techniques for extracting bitmaps of facial parts from videos of a talking person. The goal is to synthesize photo-realistic talking heads of high quality that show picture-perfect appearance and realistic head movements with good lip-sound synchronization. For the synthesis of a talking head, bitmaps of facial parts are combined to form whole heads and then sequences of such images are integrated with audio from a text-to-speech synthesizer. For a seamless integration of facial parts, their shape and visual appearance must be known with high accuracy. When a person is recorded for such a task, the head is moving and the facial expressions change, influencing the appearance of the face. The recognition system, therefore, has to find not only the location of facial features, but must also be able to determine the head's orientation and estimate the facial expressions.Our face recognition proceeds in multiple steps, each with an increased precision. Using motion, color and shape information, the head's position and the location of the main facial features are determined first. Then smaller areas are searched with matched filters, in order to identify specific facial features with high precision. From this information a head's 3D orientation is calculated. Facial parts are cut from the image and, using the head's orientation, are warped into bitmaps with 'normalized' orientation and scale.In order to synthesize naturally looking heads, not only the static appearances of a face, but also the whole dynamics of the facial deformations have to be captured and rendered with high precision. By translating all facial parts into a normalized view, we can describe their dynamics with a few parameters. For example, we record the normalized parameters of the lip shape for diphones and the most common triphones. Such sample-based co-articulation produces more naturally looking synthesized speech than model-based co-articulation.