Object category recognition using probabilistic fusion of speech and image classifiers

Authors:
Kate Saenko;Trevor Darrell
Affiliations:
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
Venue:
MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Year:
2007

Citing 6
Cited 2

“Put-that-there”: Voice and gesture at the graphics interface

SIGGRAPH '80 Proceedings of the 7th annual conference on Computer graphics and interactive techniques
Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality

Proceedings of the 5th international conference on Multimodal interfaces
Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories

CVPRW '04 Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'04) Volume 12 - Volume 12
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Learning Object Categories from Google"s Image Search

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2

Towards adaptive object recognition for situated human-computer interaction

Proceedings of the 2007 workshop on Multimodal interfaces in semantic interaction
Audio-visual robot command recognition: D-META'12 grand challenge

Proceedings of the 14th ACM international conference on Multimodal interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multimodal scene understanding is an integral part of humanrobot interaction (HRI) in situated environments. Especially useful is category-level recognition, where the the system can recognize classes of objects of scenes rather than specific instances (e.g., any chair vs. this particular chair.) Humans use multiple modalities to understand which object category is being referred to, simultaneously interpreting gesture, speech and visual appearance, and using one modality to disambiguate the information contained in the others. In this paper, we address the problem of fusing visual and acoustic information to predict object categories, when an image of the object and speech input from the user is available to the HRI system. Using probabilistic decision fusion, we show improved classification rates on a dataset containing a wide variety of object categories, compared to using either modality alone.