Technical Section: Facial animation based on context-dependent visemes

  • Authors:
  • José Mario De Martino;Léo Pini Magalhães;Fábio Violaro

  • Affiliations:
  • Department of Computer Engineering and Industrial Automation, School of Electrical and Computer Engineering, State University of Campinas, 13083-970 - Av. Albert Einstein, 400, Campinas, SP, Brazi ...;Department of Computer Engineering and Industrial Automation, School of Electrical and Computer Engineering, State University of Campinas, 13083-970 - Av. Albert Einstein, 400, Campinas, SP, Brazi ...;Department of Communications, School of Electrical and Computer Engineering, State University of Campinas, 13083-970 - Av. Albert Einstein, 400, Campinas, SP, Brazil

  • Venue:
  • Computers and Graphics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a novel approach for the generation of realistic speech synchronized 3D facial animation that copes with anticipatory and perseveratory coarticulation. The methodology is based on the measurement of 3D trajectories of fiduciary points marked on the face of a real speaker during the speech production of CVCV non-sense words. The trajectories are measured from standard video sequences using stereo vision photogrammetric techniques. The first stationary point of each trajectory associated with a phonetic segment is selected as its articulatory target. By clustering according to geometric similarity all articulatory targets of a same segment in different phonetic contexts, a set of phonetic context-dependent visemes accounting for coarticulation is identified. These visemes are then used to drive a set of geometric transformation/deformation models that reproduce the rotation and translation of the temporomandibular joint on the 3D virtual face, as well as the behavior of the lips, such as protrusion, and opening width and height of the natural articulation. This approach is being used to generate 3D speech synchronized animation from both natural and synthetic speech generated by a text-to-speech synthesizer.