Towards adaptive object recognition for situated human-computer interaction

Authors:
Kate Saenko;Trevor Darrell
Affiliations:
Massachusets Institute of Technology, Cambridge, MA;Massachusets Institute of Technology, Cambridge, MA
Venue:
Proceedings of the 2007 workshop on Multimodal interfaces in semantic interaction
Year:
2007

Citing 5
Cited 1

“Put-that-there”: Voice and gesture at the graphics interface

SIGGRAPH '80 Proceedings of the 7th annual conference on Computer graphics and interactive techniques
Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality

Proceedings of the 5th international conference on Multimodal interfaces
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Learning Object Categories from Google"s Image Search

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Object category recognition using probabilistic fusion of speech and image classifiers

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction

Integrating Graph-Based Vision Perception to Spoken Conversation in Human-Robot Interaction

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Object recognition is an important part of human-computer interaction in situated environments, such as a home or an office. Especially useful is category-level recognition (e.g., recognizing the class of chairs, as opposed to a particular chair.) While humans can employ multimodal cues for cate-gorizing objects during situated conversational interactions, most computer algorithms currently rely on vision-only or speech-only recognition. We are developing a method for learning about physical objects found in a situated environment based on visual and spoken input provided by the user. The algorithm operates on generic databases of labeled object images and transcribed speech data, plus unlabeled audio and images of a user refering to objects in the environment. By exploiting the generic labeled databases, the algorithm would associate probable object-referring words with probable visual representations of those objects, and use both modalities to determine the object label. The first advantage of this approach over visual-only or speech-only recognition is the ability to disambiguate object categories using complementary information sources. The second advantage is that, using the additional unlabeled data gathered during the interaction, the system can potentially improve its recognition of new category instances in the physical environment in which it is situated, as well as of new utterances spoken by the same user, compared to a system that uses only the generic labeled databases. It can achieve this by adapting its generic object classifiers and its generic speech and language models