Predicting synthetic voice style from facial expressions. An application for augmented conversations

Authors:
íva Székely;Zeeshan Ahmed;Shannon Hennig;João P. Cabral;Julie Carson-Berndsen
Affiliations:
-;-;-;-;-
Venue:
Speech Communication
Year:
2014

Citing 4
Cited 0

Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005

IEICE - Transactions on Information and Systems
Face detection and tracking in video sequences using the modifiedcensus transformation

Image and Vision Computing
Audio-visual prosody: perception, detection, and synthesis of prominence

Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues
WinkTalk: a demonstration of a multimodal speech synthesis platform linking facial expressions to expressive synthetic voices

SLPAT '12 Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automatically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener.