International Journal of Computer Vision
Audio-visual sports highlights extraction using Coupled Hidden Markov Models
Pattern Analysis & Applications
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Short-term audio-visual atoms for generic video concept classification
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Audio/video fusion for objects recognition
IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
Object category recognition using probabilistic fusion of speech and image classifiers
MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Object category detection using audio-visual cues
ICVS'08 Proceedings of the 6th international conference on Computer vision systems
Theory and Applications of Digital Speech Processing
Theory and Applications of Digital Speech Processing
Conjugate mixture models for clustering multimodal data
Neural Computation
Finding audio-visual events in informal social gatherings
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Audio and video feature fusion for activity recognition in unconstrained videos
IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Scene flow estimation by growing correspondence seeds
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Action recognition robust to background clutter by using stereo vision
ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
Hi-index | 0.00 |
This paper addresses the problem of audio-visual command recognition in the framework of the D-META Grand Challenge1. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audio-visual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audio-visual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.