A visually grounded natural language interface for reference to spatial scenes

Authors:
Peter Gorniak;Deb Roy
Affiliations:
MIT Media Laboratory, Cambridge, MA;MIT Media Laboratory, Cambridge, MA
Venue:
Proceedings of the 5th international conference on Multimodal interfaces
Year:
2003

Citing 9
Cited 3

Natural language understanding (2nd ed.)

Natural language understanding (2nd ed.)
A computational model of color perception and color naming

A computational model of color perception and color naming
Integration and synchronization of input modes during multimodal human-computer interaction

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Karma: knowledge-based active representations for metaphor and aspect

Karma: knowledge-based active representations for metaphor and aspect
Augmenting user interfaces with adaptive speech commands

Proceedings of the 5th international conference on Multimodal interfaces
Designing the user interface for multimodal speech and pen-based gesture applications: state-of-the-art systems and future research directions

Human-Computer Interaction
Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic

Journal of Artificial Intelligence Research
Ubiquitous talker: spoken language interaction with real world objects

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Augmenting user interfaces with adaptive speech commands

Proceedings of the 5th international conference on Multimodal interfaces
Engaging in a conversation with synthetic characters along the virtuality continuum

SG'05 Proceedings of the 5th international conference on Smart Graphics
The blue one to the left: enabling expressive user interaction in a multimodal interface for object selection in virtual 3d environments

Proceedings of the 14th ACM international conference on Multimodal interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many user interfaces, from graphic design programs to navigation aids in cars, share a virtual space with the user. Such applications are often ideal candidates for speech interfaces that allow the user to refer to objects in the shared space. We present an analysis of how people describe objects in spatial scenes using natural language. Based on this study, we describe a system that uses synthetic vision to "see" such scenes from the person's point of view, and that understands complex natural language descriptions referring to objects in the scenes. This system is based on a rich notion of semantic compositionality embedded in a grounded language understanding framework. We describe its semantic elements, their compositional behaviour, and their grounding through the synthetic vision system. To conclude, we evaluate the performance of the system on unconstrained input.