Audio-visual robot command recognition: D-META'12 grand challenge

Authors:
Jordi Sanchez-Riera;Xavier Alameda-Pineda;Radu Horaud
Affiliations:
INRIA Rhone Alpes, Montbonnot, France;INRIA Rhone Alpes, Montbonnot, France;INRIA Rhone Alpes, Montbonnor, France
Venue:
Proceedings of the 14th ACM international conference on Multimodal interaction
Year:
2012

Citing 13
Cited 0

On Space-Time Interest Points

International Journal of Computer Vision
Audio-visual sports highlights extraction using Coupled Hidden Markov Models

Pattern Analysis & Applications
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Short-term audio-visual atoms for generic video concept classification

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Audio/video fusion for objects recognition

IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
Object category recognition using probabilistic fusion of speech and image classifiers

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Object category detection using audio-visual cues

ICVS'08 Proceedings of the 6th international conference on Computer vision systems
Theory and Applications of Digital Speech Processing

Theory and Applications of Digital Speech Processing
Conjugate mixture models for clustering multimodal data

Neural Computation
Finding audio-visual events in informal social gatherings

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Audio and video feature fusion for activity recognition in unconstrained videos

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Scene flow estimation by growing correspondence seeds

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Action recognition robust to background clutter by using stereo vision

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of audio-visual command recognition in the framework of the D-META Grand Challenge1. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audio-visual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audio-visual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.