Toward multimodal fusion of affective cues
Proceedings of the 1st ACM international workshop on Human-centered multimedia
Hi-index | 0.00 |
Hand gestures and speech comprise the most important modalities of human to human interaction. Motivated by this, there has been a considerable interest in incorporating these modalities for "natural" human-computer interaction (HCI) particularly within virtual environments. An important feature of such a natural interface would be an absence of predefined speech and gesture commands. The resulting bimodal speech/gesture HCI "language" would thus have to be interpreted by the computer. This involves challenge ranging from the low-level signal processing of bimodal (audio/video) input to the high level interpretation of natural speech/gesture in HCI. This paper identifies the issues of natural (non-prefixed) multimodal HCI interpretation. Since, in the natural interaction, gestures do not exhibit one-to-one mapping of their form to meaning, we specifically address problems associated with vision-based gesture interpretation in a multimodal interface. We consider the design of a speech/gesture interface in the context of a set of spatial tasks defined on a computerized campus map. The task context makes it possible to study the critical components of the multimodal interpretation and integration problem.