Control of speech-related facial movements of an avatar from video

  • Authors:
  • Guillaume Gibert;Yvonne Leung;Catherine J. Stevens

  • Affiliations:
  • INSERM U846, 18 avenue Doyen Lépine, 69500 Bron Cedex, France and Stem Cell and Brain Research Institute, 69500 Bron Cedex, France and Université de Lyon, Université Lyon 1, 69003 L ...;Marcs Institute, University of Western Sydney, Locked Bag 1797, Penrith, NSW 2751, Australia;Marcs Institute, University of Western Sydney, Locked Bag 1797, Penrith, NSW 2751, Australia and School of Social Sciences & Psychology, University of Western Sydney, Locked Bag 1797, Penrith, NSW ...

  • Venue:
  • Speech Communication
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several puppetry techniques have been recently proposed to transfer emotional facial expressions to an avatar from a user's video. Whereas generation of facial expressions may not be sensitive to small tracking errors, generation of speech-related facial movements would be severely impaired. Since incongruent facial movements can drastically influence speech perception, we proposed a more effective method to transfer speech-related facial movements from a user to an avatar. After a facial tracking phase, speech articulatory parameters (controlling the jaw and the lips) were determined from the set of landmark positions. Two additional processes calculated the articulatory parameters which controlled the eyelids and the tongue from the 2D Discrete Cosine Transform coefficients of the eyes and inner mouth images. A speech in noise perception experiment was conducted on 25 participants to evaluate the system. Increase in intelligibility was shown for the avatar and human auditory-visual conditions compared to the avatar and human auditory-only conditions, respectively. Depending on the vocalic context, the results of the avatar auditory-visual presentation were different: all the consonants were better perceived in /a/ vocalic context compared to /i/ and /u/ because of the lack of depth information retrieved from video. This method could be used to accurately animate avatars for hearing impaired people using information technologies and telecommunication.