Exploring STIP-based models for recognizing human interactions in TV videos

Authors:
Manuel J. Marín-Jiménez;Enrique Yeguas;Nicolás Pérez De La Blanca
Affiliations:
Department of Computer Science and Numerical Analysis, University of Córdoba, 14071 Córdoba, Spain;Department of Computer Science and Numerical Analysis, University of Córdoba, 14071 Córdoba, Spain;Department of Computer Science and Artificial Intelligence, University of Granada, 18071 Granada, Spain
Venue:
Pattern Recognition Letters
Year:
2013

Citing 20
Cited 0

A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Support Vector Machines: Training and Applications

Support Vector Machines: Training and Applications
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
On Space-Time Interest Points

International Journal of Computer Vision
Free viewpoint action recognition using motion history volumes

Computer Vision and Image Understanding - Special issue on modeling people: Vision-based understanding of a person's shape, appearance, movement, and behaviour
Actions as Space-Time Shapes

IEEE Transactions on Pattern Analysis and Machine Intelligence
Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised Object Discovery: A Comparison

International Journal of Computer Vision
An overview of contest on semantic description of human activities (SDHA) 2010

ICPR'10 Proceedings of the 20th International conference on Recognizing patterns in signals, speech, images, and videos
Action Recognition Using Mined Hierarchical Compound Features

IEEE Transactions on Pattern Analysis and Machine Intelligence
Understanding interactions and guiding visual surveillance by tracking attention

ACCV'10 Proceedings of the 2010 international conference on Computer vision - Volume Part I
Selective spatio-temporal interest points

Computer Vision and Image Understanding
Efficient Additive Kernels via Explicit Feature Maps

IEEE Transactions on Pattern Analysis and Machine Intelligence
Human detection using oriented histograms of flow and appearance

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part II
Action recognition by dense trajectories

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Machine Recognition of Human Activities: A Survey

IEEE Transactions on Circuits and Systems for Video Technology
Human action recognition by learning bases of action attributes and parts

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Structured Learning of Human Interactions in TV Shows

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.10

Visualization

Abstract

Human motion recognition - action (HAR) or interaction (HIR) - in real video data is identified as a very challenging task. In the last few years models of increasing complexity have been proposed in order to improve the performance in the task. However, it still remains unclear whether it is the features or the models what deserves the increase in complexity. In this paper an evaluation of such problem is carried out in the HIR task. For that purpose, we compare the results obtained in our experiments - by using STIP-based features and BOW models as basis and combined with a standard classifier - with some of the more effective and recent approaches that use alternative representation models. We perform a comprehensive experimental study on two state-of-the-art databases in HIR: TV Human interactions and UT-interactions. We compare the results of our experiments with recent results published on these datasets. In addition, we run cross-data experiments on Hollywood-2 dataset in order to study the capability of generalization of the trained models through different datasets. The most relevant result is that the model combining STIP+BOW is competitive in the HIR task in comparison with the most complex ones. It is also shown that the vocabulary learning subtask can be improved by using compression algorithms on large enough initial set of features. In contrast to other categorization tasks the context does not help, the results show that dense sampling of STIP is the best choice, but only when it is used inside the region of interest of the interaction.