Human interaction categorization by using audio-visual cues

Authors:
M. J. Marín-Jiménez;R. Muñoz-Salinas;E. Yeguas-Bolivar;N. Pérez De La Blanca
Affiliations:
Department of Computing and Numerical Analysis, Maimonides Institute for Biomedical Research (IMIBIC), University of Córdoba, Córdoba, Spain 14071;Department of Computing and Numerical Analysis, Maimonides Institute for Biomedical Research (IMIBIC), University of Córdoba, Córdoba, Spain 14071;Department of Computing and Numerical Analysis, Maimonides Institute for Biomedical Research (IMIBIC), University of Córdoba, Córdoba, Spain 14071;Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain 18071
Venue:
Machine Vision and Applications
Year:
2014

Citing 20
Cited 1

Numerical Recipes in C++: the art of scientific computing

Numerical Recipes in C++: the art of scientific computing
Semantic Video Retrieval Using Audio Analysis

CIVR '02 Proceedings of the International Conference on Image and Video Retrieval
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Automatic Analysis of Multimodal Group Actions in Meetings

IEEE Transactions on Pattern Analysis and Machine Intelligence
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
On Space-Time Interest Points

International Journal of Computer Vision
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Free viewpoint action recognition using motion history volumes

Computer Vision and Image Understanding - Special issue on modeling people: Vision-based understanding of a person's shape, appearance, movement, and behaviour
Actions as Space-Time Shapes

IEEE Transactions on Pattern Analysis and Machine Intelligence
More generality in efficient multiple kernel learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Consumer video understanding: a benchmark database and an evaluation of human and machine performance

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Understanding interactions and guiding visual surveillance by tracking attention

ACCV'10 Proceedings of the 2010 international conference on Computer vision - Volume Part I
Efficient Additive Kernels via Explicit Feature Maps

IEEE Transactions on Pattern Analysis and Machine Intelligence
Human detection using oriented histograms of flow and appearance

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part II
Machine Recognition of Human Activities: A Survey

IEEE Transactions on Circuits and Systems for Video Technology
Joint audio-visual bi-modal codewords for video event detection

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Robust late fusion with rank minimization

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Human action recognition by learning bases of action attributes and parts

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Structured Learning of Human Interactions in TV Shows

IEEE Transactions on Pattern Analysis and Machine Intelligence

Special issue on Multimedia Event Detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Human Interaction Recognition (HIR) in uncontrolled TV video material is a very challenging problem because of the huge intra-class variability of the classes (due to large differences in the way actions are performed, lighting conditions and camera viewpoints, amongst others) as well as the existing small inter-class variability (e.g., the visual difference between hug and kiss is very subtle). Most of previous works have been focused only on visual information (i.e., image signal), thus missing an important source of information present in human interactions: the audio. So far, such approaches have not shown to be discriminative enough. This work proposes the use of Audio-Visual Bag of Words (AVBOW) as a more powerful mechanism to approach the HIR problem than the traditional Visual Bag of Words (VBOW). We show in this paper that the combined use of video and audio information yields to better classification results than video alone. Our approach has been validated in the challenging TVHID dataset showing that the proposed AVBOW provides statistically significant improvements over the VBOW employed in the related literature.